Friday, March 21, 2025
A different approach to blocking bad webbots by IP address
Web crawlers for LLM-based companies, as well as some specific solutions to blocking them, have been making the rounds in the past few days. I was curious to see just how many were hitting my web site, so I ran a few queries over the log files. To ensure consistent results, I decided to query the log file for last month:
total requests | 468439 |
unique IPs | 24654 |
IP | Requests |
---|---|
4.231.104.62 | 43242 |
198.100.155.33 | 26650 |
66.55.200.246 | 9057 |
74.80.208.170 | 8631 |
74.80.208.59 | 8407 |
216.244.66.239 | 5998 |
4.227.36.126 | 5832 |
20.171.207.130 | 5817 |
8.29.198.26 | 4946 |
8.29.198.25 | 4807 |
(Note: I'm not concerned about protecting any privacy here—given the number of results, there is no way these are any individual. These are all companies hitting my site, and if companies are mining their data for my information, I'm going to do the same to them. So there.)
But it became apparent that it's hard to determine which requests are coming from a single entity—it's clear that a company can employ a large pool of IP addresses to crawl the web, and it's hard to figure out what IPs are under control of which company.
Or is it?
An idea suddenly hit me—a stray thought from the days when I was wearing a network admin hat I recalled that BGP routing basically knows the network boundaries for networks as it's based on policy routing via ASNs. I wonder if I could map IP addresses to ASNs? A quick search and I found my answer—yes! Within a few minutes, I had converted a list of 24,654 unique IP addresses to 1,490 unique networks, I was then able to rework my initial query to include the ASN (or rather, the human readable version instead of just the number):
IP | Requests | AS |
---|---|---|
4.231.104.62 | 43242 | MICROSOFT-CORP-MSN-AS-BLOCK, US |
198.100.155.33 | 26650 | OVH, FR |
66.55.200.246 | 9057 | BIDDEFORD1, US |
74.80.208.170 | 8631 | CSTL, US |
74.80.208.59 | 8407 | CSTL, US |
216.244.66.239 | 5998 | WOW, US |
4.227.36.126 | 5832 | MICROSOFT-CORP-MSN-AS-BLOCK, US |
20.171.207.130 | 5817 | MICROSOFT-CORP-MSN-AS-BLOCK, US |
8.29.198.26 | 4946 | FEEDLY-DEVHD, US |
8.29.198.25 | 4807 | FEEDLY-DEVHD, US |
Now, I was curious as to how they identified themselves, so I reran the query to include the user agent string. The top eight identified themselves consistently:
Agent | Requests |
---|---|
Go-http-client/2.0 | 43236 |
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/132.0.0.0 Safari/537.36 | 26650 |
WF search/Nutch-1.12 | 9057 |
Mozilla/5.0 (compatible; ImagesiftBot; +imagesift.com) | 8631 |
Mozilla/5.0 (compatible; ImagesiftBot; +imagesift.com) | 8407 |
Mozilla/5.0 (compatible; DotBot/1.2; +https://opensiteexplorer.org/dotbot; help@moz.com) | 5998 |
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot) | 5832 |
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot) | 5817 |
The last two, however had a changing user agent string:
Agent | Requests |
---|---|
Feedly/1.0 (+https://feedly.com/poller.html; 16 subscribers; ) | 1667 |
Feedly/1.0 (+https://feedly.com/poller.html; 6 subscribers; ) | 1419 |
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 6 subscribers; ) | 938 |
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 16 subscribers; ) | 811 |
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 8 subscribers; ) | 94 |
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 37 subscribers; ) | 17 |
Agent | Requests |
---|---|
Feedly/1.0 (+https://feedly.com/poller.html; 16 subscribers; ) | 1579 |
Feedly/1.0 (+https://feedly.com/poller.html; 6 subscribers; ) | 1481 |
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 6 subscribers; ) | 905 |
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 16 subscribers; ) | 741 |
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 8 subscribers; ) | 90 |
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 37 subscribers; ) | 11 |
I'm not sure what the difference is between polling and fetching (checking the URLs shows two identical pages, only differing in “Poller” and “Fetcher.” But looking deeper into that is for another post.
The next request I did was to see how many IPs (that hit my site in February) map to a particular ASN, and the top 10 are:
AS | Count |
---|---|
ALIBABA-CN-NET Alibaba US Technology Co., Ltd., CN | 4034 |
AMAZON-02, US | 1733 |
HWCLOUDS-AS-AP HUAWEI CLOUDS, HK | 1527 |
GOOGLE-CLOUD-PLATFORM, US | 996 |
COMCAST-7922, US | 895 |
AMAZON-AES, US | 719 |
TENCENT-NET-AP-CN Tencent Building, Kejizhongyi Avenue, CN | 635 |
MICROSOFT-CORP-MSN-AS-BLOCK, US | 615 |
AS-VULTR, US | 599 |
ATT-INTERNET4, US | 472 |
So Alibaba US crawled my site from 4,034 different IP addresses—I haven't done the query to figure out how many requests each ASN did, but it should be a straightforward thing to just replace IP address with the ASN to get a better count of which company is crawling my site the hardest.
And now I'm thinking, I wonder if instead of a form of ad-hoc banning of single IP addresses, or blocking huge swaths of IP addresses (like 47.0.0.0/8, it might not be better to block per ASN? The IP to ASN mapping service I found makes it quite easy to get the ASN of an IP address (and to map the ASN to an human-readable name), Instead of, for example, blocking 101.32.0.0/16, 119.28.0.0/16, 43.128.0.0/14, 43.153.0.0/16 and 49.51.0.0/16 (which isn't an exaustive list by any means) just block IPs belonging to ASN 132203, otherwise known as “TENCENT-NET-AP-CN Tencent Building, Kejizhongyi Avenue, CN.”
I don't know how effective that idea is, but the IP-to-ASN site I found does offer the information via DNS, so it shouldn't be that hard to do.
A deeper dive into mapping web requests via ASN, not by IP address
I went ahead and replaced IP addresses with ASNs in the log file to find the network that sent the most requests to my blog for the month of February.
MICROSOFT-CORP-MSN-AS-BLOCK, US | 78889 |
OVH, FR | 31837 |
ALIBABA-CN-NET Alibaba US Technology Co., Ltd., CN | 25019 |
HETZNER-AS, DE | 23840 |
GOOGLE-CLOUD-PLATFORM, US | 21431 |
CSTL, US | 17225 |
HURRICANE, US | 15495 |
AMAZON-AES, US | 14430 |
FACEBOOK, US | 13736 |
AKAMAI-LINODE-AP Akamai Connected Cloud, SG | 12673 |
Even though Alibaba US has the most unique IPs hitting my blog, Microsoft is still the network making the most requests. So let's see how Microsoft presents itself to my web server. Here are the user agents it sends:
agent | requests |
---|---|
Go-http-client/2.0 | 43236 |
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot) | 23978 |
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/116.0.1938.76 Safari/537.36 | 7953 |
Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0 | 2955 |
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36; compatible; OAI-SearchBot/1.0; +https://openai.com/searchbot | 210 |
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/1.0; +https://openai.com/bot | 161 |
DuckDuckBot/1.1; (+http://duckduckgo.com/duckduckbot.html) | 123 |
'DuckDuckBot-Https/1.1; (+https://duckduckgo.com/duckduckbot)' | 122 |
Python/3.9 aiohttp/3.10.6 | 28 |
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.6478.36 Safari/537.36 | 14 |
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.6422.114 Safari/537.36 | 14 |
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36 Edg/112.0.1722.68 | 10 |
DuckAssistBot/1.2; (+http://duckduckgo.com/duckassistbot.html) | 10 |
DuckAssistBot/1.1; (+http://duckduckgo.com/duckassistbot.html) | 10 |
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36 | 6 |
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.6422.143 Safari/537.36 | 6 |
python-requests/2.32.3 | 5 |
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.6422.142 Safari/537.36 | 5 |
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36 | 4 |
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:77.0) Gecko/20100101 Firefox/77.0 | 4 |
DuckDuckBot-Https/1.1; (+https://duckduckgo.com/duckduckbot) | 4 |
Twingly Recon | 3 |
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot) | 3 |
Mozilla/5.0 (compatible; Twingly Recon; twingly.com) | 3 |
python-requests/2.28.2 | 2 |
newspaper/0.9.1 | 2 |
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36 | 2 |
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534+ (KHTML, like Gecko) BingPreview/1.0b | 2 |
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36 | 2 |
http.rb/5.1.1 (Mastodon/4.2.10; +https://trystero.social/) Bot | 1 |
http.rb/5.1.1 (Mastodon/4.2.10; +https://trystero.social/) | 1 |
Mozilla/5.0 (Windows NT 6.1; WOW64) SkypeUriPreview Preview/0.5 skype-url-preview@microsoft.com | 1 |
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 | 1 |
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36 | 1 |
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36 Edg/112.0.1722.48 | 1 |
Mastodon/4.4.0-alpha.2 (http.rb/5.2.0; +https://sns.mszpro.com/) Bot | 1 |
Mastodon/4.4.0-alpha.2 (http.rb/5.2.0; +https://sns.mszpro.com/) | 1 |
Mastodon/4.3.3 (http.rb/5.2.0; +https://the.voiceover.bar/) Bot | 1 |
Mastodon/4.3.3 (http.rb/5.2.0; +https://the.voiceover.bar/) | 1 |
Mastodon/4.3.3 (http.rb/5.2.0; +https://discuss.systems/) Bot | 1 |
Mastodon/4.3.3 (http.rb/5.2.0; +https://discuss.systems/) | 1 |
The top result comes from a single IP address and probably requires a separate post about it, since it's weird and annoying. But the rest—you got Bing, you got OpenAI, you got several Mastodon instances—it seems like most of these are from Microsoft's cloud offering. A mixture of things.
What about Facebook?
agent | requests |
---|---|
meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler) | 13497 |
facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php) | 207 |
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/132.0.0.0 Safari/537.36 | 12 |
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36 | 4 |
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36 | 4 |
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36 | 4 |
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:58.0) Gecko/20100101 Firefox/59.0 | 4 |
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/132.0.0.0 Safari/537.36 Edg/132.0.0.0 | 2 |
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36 | 2 |
Hmm … looks like I have a few readers at Facebook, but other than that, nothing terribly interesting.
Alibaba,
on the other hand,
is frightening.
Out of 25,019 requests,
it presented 581 different user agents.
From looking at what was requested,
I don't think it's 500 Chinese people reading my blog—it's defintely bots crawling my site
(and amusingly, there are requests to /robots.txt
file,
but without a proper user agent to go by,
it's hard to block it via that file).
I can think of one conclusion here—if you do filter by ASN, it can help tremendously, but it also comes with possibly blocking legitimate traffic.
Still no information on who “The Knowledge AI” is or was
Back in July 2019 I was investigating some bad bots on my website when I came across the bot that identified itself simply as “The Knowledge AI” that was the number one robot hitting my site. Most bots that identify themselves will give a URL to a page that describes their usage like Barkrowler (to pick one that recently crawled my site). But not so “The Knowledge AI”. That was all it said, “The Knowledge AI”. It was very hard to Google, but I wouldn’t be surprised if it was OpenAI.
The earliest I can find “The Knowledge AI” crawling my site was April of 2018, and despite starting on April 16th, it was the second most active robot that month. In May it was the number one bot, and it stayed there through October of 2022, after which it pretty much dropped—from 32,000+ in October of 2022 to 85 in November of 2022 (about 4½ years). It was sporadic, showing up in single digit hits until January of 2024. It may be still crawling my site, but if it is, it is no longer identifying itself.
I don’t know if “The Knowledge AI” was an LLM company crawling, but if it was, not giving a link to explain the bot is suspicious. It’s the rare crawler that doesn’t identify itself with at least a URL to describe it. The fact that it took the number one crawling spot on my site for 4 ½ years is suspicious. As robots go, it didn’t affect the web server all that much (I’ve come across worse ones), and well over 90% of its requests were valid (unlike MJ12, which had a 75% failure rate). And my /robots.txt file doesn’t exclude any robot from scanning, so I can’t really complain about it.
My comment on “Mitigating SourceHut's partial outage caused by aggressive crawlers | Lobsters”
Even though the log data is a few years old, I don't think that IPs change from ASN to ASN all that much (but I could be wrong on that). I checked the IPs used by “The Knowledge AI” in May 2018, and in October 2022, and they didn't change that much. They were still the same /24 networks across that time.
Looking up the information today is very disappointing—Hurricane Electric LLC., a backbone provider.
So no real information about who “The Knowledge AI” might have been.
Sigh.
Now a bit about feed readers
There are a few bots acting less than optimally that aren't some LLM-based company scraping my site. I think. Anyway, the first one I mentioned:
Agent | Requests |
---|---|
Feedly/1.0 (+https://feedly.com/poller.html; 16 subscribers; ) | 1667 |
Feedly/1.0 (+https://feedly.com/poller.html; 6 subscribers; ) | 1419 |
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 6 subscribers; ) | 938 |
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 16 subscribers; ) | 811 |
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 8 subscribers; ) | 94 |
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 37 subscribers; ) | 17 |
Agent | Requests |
---|---|
Feedly/1.0 (+https://feedly.com/poller.html; 16 subscribers; ) | 1579 |
Feedly/1.0 (+https://feedly.com/poller.html; 6 subscribers; ) | 1481 |
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 6 subscribers; ) | 905 |
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 16 subscribers; ) | 741 |
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 8 subscribers; ) | 90 |
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 37 subscribers; ) | 11 |
This is feedly, a company that offers a news reader (and I'd like to thank the 67 subscribers I have—thank you). The first issue I have about this client is the apparent redundant requests from six different clients. An issue because I only have three different feeds, the Atom feed, the RSS feed and the the JSON feed. The poller seems to be acting correctly—16 subscribers to my Atom feed and 6 to the RSS feed. The other four? The fetchers? I'm not sure what's going on there. There's one for the RSS feed, and three for the Atom feed. And one of them is a typo—it's requesting “//index.atom” instead of the proper “/index.atom” (but apparently Apache allows it). How do I have 16 subscribers to “/index.atom” and another 37 for “/index.atom”? What exactly, is the difference between the two? And can't you fix the “//index.atom” reference? To me, that's an obvious typo, one that could be verified by retreiving both “/index.atom” and “//index.atom” and seeing they're the same.
Anyway, the second issue I have with feedly is their apparent lack of caching on their end. They do not do a conditional request and while they aren't exactly slamming my server, they are making multiple requests per hour, and for a resource that doesn't change all that often (excluding today that is).
Then there's the bot at IP address 4.231.104.62. It made 43,236 requests to get “/index.atom”, 5 invalid requests in the form of “/gopher://gopher.conman.org/0Phlog:2025/02/…” and one other valid request for this page. It's not the 5 invalid requests or the 1 valid request that has me weirded out—it's the 43,236 to my Atom feed. That's one request every 55 seconds! And even worse—it's not a conditional request! Of all the bots, this is the one I feel most like blocking at the firewall level—just have it drop the packets entirely.
At least it supports compressed results.
Sheesh.
As for the rest—of the 109 bots that fetched the Atom feed at least once per day (I put the cut off at 28 requests or more durring February), only 31 did so conditionally. That's a horrible rate. And of the 31 that did so conditionally, most don't support compression. So on the one hand, the majority of bots that fetch the Atom feed do so compressed. On the other hand, it appears that the bots that do fetch conditionally most don't support compression.
Sigh.