Friday, March 21, 2025
A different approach to blocking bad webbots by IP address
Web crawlers for LLM-based companies, as well as some specific solutions to blocking them, have been making the rounds in the past few days. I was curious to see just how many were hitting my web site, so I ran a few queries over the log files. To ensure consistent results, I decided to query the log file for last month:
total requests | 468439 |
unique IPs | 24654 |
IP | Requests |
---|---|
4.231.104.62 | 43242 |
198.100.155.33 | 26650 |
66.55.200.246 | 9057 |
74.80.208.170 | 8631 |
74.80.208.59 | 8407 |
216.244.66.239 | 5998 |
4.227.36.126 | 5832 |
20.171.207.130 | 5817 |
8.29.198.26 | 4946 |
8.29.198.25 | 4807 |
(Note: I'm not concerned about protecting any privacy here—given the number of results, there is no way these are any individual. These are all companies hitting my site, and if companies are mining their data for my information, I'm going to do the same to them. So there.)
But it became apparent that it's hard to determine which requests are coming from a single entity—it's clear that a company can employ a large pool of IP addresses to crawl the web, and it's hard to figure out what IPs are under control of which company.
Or is it?
An idea suddenly hit me—a stray thought from the days when I was wearing a network admin hat I recalled that BGP routing basically knows the network boundaries for networks as it's based on policy routing via ASNs. I wonder if I could map IP addresses to ASNs? A quick search and I found my answer—yes! Within a few minutes, I had converted a list of 24,654 unique IP addresses to 1,490 unique networks, I was then able to rework my initial query to include the ASN (or rather, the human readable version instead of just the number):
IP | Requests | AS |
---|---|---|
4.231.104.62 | 43242 | MICROSOFT-CORP-MSN-AS-BLOCK, US |
198.100.155.33 | 26650 | OVH, FR |
66.55.200.246 | 9057 | BIDDEFORD1, US |
74.80.208.170 | 8631 | CSTL, US |
74.80.208.59 | 8407 | CSTL, US |
216.244.66.239 | 5998 | WOW, US |
4.227.36.126 | 5832 | MICROSOFT-CORP-MSN-AS-BLOCK, US |
20.171.207.130 | 5817 | MICROSOFT-CORP-MSN-AS-BLOCK, US |
8.29.198.26 | 4946 | FEEDLY-DEVHD, US |
8.29.198.25 | 4807 | FEEDLY-DEVHD, US |
Now, I was curious as to how they identified themselves, so I reran the query to include the user agent string. The top eight identified themselves consistently:
Agent | Requests |
---|---|
Go-http-client/2.0 | 43236 |
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/132.0.0.0 Safari/537.36 | 26650 |
WF search/Nutch-1.12 | 9057 |
Mozilla/5.0 (compatible; ImagesiftBot; +imagesift.com) | 8631 |
Mozilla/5.0 (compatible; ImagesiftBot; +imagesift.com) | 8407 |
Mozilla/5.0 (compatible; DotBot/1.2; +https://opensiteexplorer.org/dotbot; help@moz.com) | 5998 |
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot) | 5832 |
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot) | 5817 |
The last two, however had a changing user agent string:
Agent | Requests |
---|---|
Feedly/1.0 (+https://feedly.com/poller.html; 16 subscribers; ) | 1667 |
Feedly/1.0 (+https://feedly.com/poller.html; 6 subscribers; ) | 1419 |
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 6 subscribers; ) | 938 |
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 16 subscribers; ) | 811 |
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 8 subscribers; ) | 94 |
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 37 subscribers; ) | 17 |
Agent | Requests |
---|---|
Feedly/1.0 (+https://feedly.com/poller.html; 16 subscribers; ) | 1579 |
Feedly/1.0 (+https://feedly.com/poller.html; 6 subscribers; ) | 1481 |
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 6 subscribers; ) | 905 |
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 16 subscribers; ) | 741 |
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 8 subscribers; ) | 90 |
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 37 subscribers; ) | 11 |
I'm not sure what the difference is between polling and fetching (checking the URLs shows two identical pages, only differing in “Poller” and “Fetcher.” But looking deeper into that is for another post.
The next request I did was to see how many IPs (that hit my site in February) map to a particular ASN, and the top 10 are:
AS | Count |
---|---|
ALIBABA-CN-NET Alibaba US Technology Co., Ltd., CN | 4034 |
AMAZON-02, US | 1733 |
HWCLOUDS-AS-AP HUAWEI CLOUDS, HK | 1527 |
GOOGLE-CLOUD-PLATFORM, US | 996 |
COMCAST-7922, US | 895 |
AMAZON-AES, US | 719 |
TENCENT-NET-AP-CN Tencent Building, Kejizhongyi Avenue, CN | 635 |
MICROSOFT-CORP-MSN-AS-BLOCK, US | 615 |
AS-VULTR, US | 599 |
ATT-INTERNET4, US | 472 |
So Alibaba US crawled my site from 4,034 different IP addresses—I haven't done the query to figure out how many requests each ASN did, but it should be a straightforward thing to just replace IP address with the ASN to get a better count of which company is crawling my site the hardest.
And now I'm thinking, I wonder if instead of a form of ad-hoc banning of single IP addresses, or blocking huge swaths of IP addresses (like 47.0.0.0/8, it might not be better to block per ASN? The IP to ASN mapping service I found makes it quite easy to get the ASN of an IP address (and to map the ASN to an human-readable name), Instead of, for example, blocking 101.32.0.0/16, 119.28.0.0/16, 43.128.0.0/14, 43.153.0.0/16 and 49.51.0.0/16 (which isn't an exaustive list by any means) just block IPs belonging to ASN 132203, otherwise known as “TENCENT-NET-AP-CN Tencent Building, Kejizhongyi Avenue, CN.”
I don't know how effective that idea is, but the IP-to-ASN site I found does offer the information via DNS, so it shouldn't be that hard to do.