I don't know what to do.
Tor:

IT'S ALL BOTS BOTS BOTS EVERYWHERE!A significant share of traffic (not necessarily unique users) from Tor is abusive or automated:Cloudflare's often-cited 2016 analysis found ~94% of Tor requests they saw were "per se malicious" — things like comment spam, vulnerability scanning, ad fraud, content scraping, and login brute-forcing. This doesn't mean 94% of users are bad actors; a small number of automated scripts/bots can generate enormous request volumes.
blog.cloudflare.com
Many organizations (security firms, governments like Australia's) describe Tor exit node traffic as "overwhelmingly malicious" in practice, leading them to block or heavily scrutinize it.
cyber.gov.au
Bots and malware sometimes use Tor for command-and-control (C&C), data exfiltration, or anonymity in attacks. Malicious relays/exit nodes have been a recurring issue too.
The current filter at Cloudflare:
Code: Select all
(ip.src.country in {"XX"}) or (ip.src.country in {"T1"}) or (not ip.src.country in {"AX" "AQ" "AT" "BY" "BE" "CA" "HR" "CZ" "DK" "EE" "FI" "FR" "DE" "GL" "HU" "IS" "IE" "IT" "JP" "LV" "LI" "LT" "LU" "MK" "MT" "MC" "NL" "NO" "PL" "PT" "RO" "SK" "SI" "ES" "SE" "CH" "GB" "UM" "US" "VA"})It has started to look more normal now. Mixing Cloudflare with AbuseIPDB.
First I do a Cloudflare filtering before they even reach the website.
Next step, I check their IPs against AbuseIPDB.
If it's found = BLOCK no matter the probability or age of it. Anything in AbuseIPDB will be automatically full access denied.
If it's not found in their database, I get the type of IP from AbuseIPDB, which can be "Commercial, Organization, Government, Military, University/College/School, Library, Content Delivery Network, Fixed Line ISP, Mobile ISP, Data Center/Web Hosting/Transit, Search Engine Spider, Reserved, University/College/School". I full block access denied "Data Center/Web Hosting/Transit" and "Content Delivery Network" and "Commercial".
It seems to clear most junk out now. Turns out, we DO have some actual human visitors. Well, hopefully. I think chance of these being bots is pretty low, not zero but fairly "ok". Based on that, we have about 10-20 daily actual visitors I guess.
Of course, this approach full blocks Google and search engine bots, so it's not too good I guess. I need to somehow edit it to allow those in but not sure how, because like motherfucking Apple bot or Apple something, whatever the hell it is, it starts thousands of crawls on the website at once. They are fucked up. And it doesn't even end, like, how many God damn times are you going to crawl the same old content? I would like to allow Google and Bing and OpenAI bots somehow. Grok/X doesn't identify itself so I don't know how to allow it even. *lol*
Sorry Africa, but I have to do full block the entire continent.
BLOCK: (ip.src.continent in {"AF"})
This one is nasty af check, anything in weird things and TOR and not in Europe / USA.
INTERACTIVE, ALL EXCEPT: (ip.src.country in {"XX"}) or (ip.src.country in {"T1"}) or (not ip.src.country in {"AX" "AQ" "AT" "BY" "BE" "CA" "HR" "CZ" "DK" "EE" "FI" "FR" "DE" "GL" "HU" "IS" "IE" "IT" "JP" "LV" "LI" "LT" "LU" "MK" "MT" "MC" "NL" "NO" "PL" "PT" "RO" "SK" "SI" "ES" "SE" "CH" "GB" "UM" "US" "VA"})
This one is more smooth automatic check for Europe / USA.
NON-INTERACTIVE: (ip.src.country in {"AX" "AQ" "AT" "BY" "BE" "CA" "HR" "CZ" "DK" "EE" "FI" "FR" "DE" "GL" "HU" "IS" "IE" "IT" "JP" "LV" "LI" "LT" "LU" "MK" "MT" "MC" "NL" "NO" "PL" "PT" "RO" "SK" "SI" "ES" "SE" "CH" "GB" "UM" "US" "VA"})
First I do a Cloudflare filtering before they even reach the website.
Next step, I check their IPs against AbuseIPDB.
If it's found = BLOCK no matter the probability or age of it. Anything in AbuseIPDB will be automatically full access denied.
If it's not found in their database, I get the type of IP from AbuseIPDB, which can be "Commercial, Organization, Government, Military, University/College/School, Library, Content Delivery Network, Fixed Line ISP, Mobile ISP, Data Center/Web Hosting/Transit, Search Engine Spider, Reserved, University/College/School". I full block access denied "Data Center/Web Hosting/Transit" and "Content Delivery Network" and "Commercial".
It seems to clear most junk out now. Turns out, we DO have some actual human visitors. Well, hopefully. I think chance of these being bots is pretty low, not zero but fairly "ok". Based on that, we have about 10-20 daily actual visitors I guess.
Of course, this approach full blocks Google and search engine bots, so it's not too good I guess. I need to somehow edit it to allow those in but not sure how, because like motherfucking Apple bot or Apple something, whatever the hell it is, it starts thousands of crawls on the website at once. They are fucked up. And it doesn't even end, like, how many God damn times are you going to crawl the same old content? I would like to allow Google and Bing and OpenAI bots somehow. Grok/X doesn't identify itself so I don't know how to allow it even. *lol*
Sorry Africa, but I have to do full block the entire continent.
BLOCK: (ip.src.continent in {"AF"})
This one is nasty af check, anything in weird things and TOR and not in Europe / USA.
INTERACTIVE, ALL EXCEPT: (ip.src.country in {"XX"}) or (ip.src.country in {"T1"}) or (not ip.src.country in {"AX" "AQ" "AT" "BY" "BE" "CA" "HR" "CZ" "DK" "EE" "FI" "FR" "DE" "GL" "HU" "IS" "IE" "IT" "JP" "LV" "LI" "LT" "LU" "MK" "MT" "MC" "NL" "NO" "PL" "PT" "RO" "SK" "SI" "ES" "SE" "CH" "GB" "UM" "US" "VA"})
This one is more smooth automatic check for Europe / USA.
NON-INTERACTIVE: (ip.src.country in {"AX" "AQ" "AT" "BY" "BE" "CA" "HR" "CZ" "DK" "EE" "FI" "FR" "DE" "GL" "HU" "IS" "IE" "IT" "JP" "LV" "LI" "LT" "LU" "MK" "MT" "MC" "NL" "NO" "PL" "PT" "RO" "SK" "SI" "ES" "SE" "CH" "GB" "UM" "US" "VA"})
for search bots:
(lower(http.user_agent) contains "google")
or (lower(http.user_agent) contains "bingbot")
or (lower(http.user_agent) contains "applebot")
or (lower(http.user_agent) contains "openai")
or (lower(http.user_agent) contains "anthropic")
or (lower(http.user_agent) contains "facebook")
or (lower(http.user_agent) contains "facebookexternalhit")
or (lower(http.user_agent) contains "twitterbot")
...seems to work.
(lower(http.user_agent) contains "google")
or (lower(http.user_agent) contains "bingbot")
or (lower(http.user_agent) contains "applebot")
or (lower(http.user_agent) contains "openai")
or (lower(http.user_agent) contains "anthropic")
or (lower(http.user_agent) contains "facebook")
or (lower(http.user_agent) contains "facebookexternalhit")
or (lower(http.user_agent) contains "twitterbot")
...seems to work.
(ip.src.country in {"SE"})
or (lower(http.user_agent) contains "google")
or (lower(http.user_agent) contains "claude")
or (lower(http.user_agent) contains "amazon")
or (lower(http.user_agent) contains "bing")
or (lower(http.user_agent) contains "applebot")
or (lower(http.user_agent) contains "openai")
or (lower(http.user_agent) contains "anthropic")
or (lower(http.user_agent) contains "facebook")
or (lower(http.user_agent) contains "gptbot")
or (lower(http.user_agent) contains "twitterbot")
or (lower(http.user_agent) contains "google")
or (lower(http.user_agent) contains "claude")
or (lower(http.user_agent) contains "amazon")
or (lower(http.user_agent) contains "bing")
or (lower(http.user_agent) contains "applebot")
or (lower(http.user_agent) contains "openai")
or (lower(http.user_agent) contains "anthropic")
or (lower(http.user_agent) contains "facebook")
or (lower(http.user_agent) contains "gptbot")
or (lower(http.user_agent) contains "twitterbot")

