A different approach to blocking bad webbots by IP address

Friday, March 21, 2025

Web crawlers for LLM-based companies, as well as some specific solutions to blocking them, have been making the rounds in the past few days. I was curious to see just how many were hitting my web site, so I ran a few queries over the log files. To ensure consistent results, I decided to query the log file for last month:

Quick summary of results for February 2025
total requests	468439
unique IPs	24654

Top 10 requests per IP
IP	Requests
4.231.104.62	43242
198.100.155.33	26650
66.55.200.246	9057
74.80.208.170	8631
74.80.208.59	8407
216.244.66.239	5998
4.227.36.126	5832
20.171.207.130	5817
8.29.198.26	4946
8.29.198.25	4807

(Note: I'm not concerned about protecting any privacy here—given the number of results, there is no way these are any individual. These are all companies hitting my site, and if companies are mining their data for my information, I'm going to do the same to them. So there.)

But it became apparent that it's hard to determine which requests are coming from a single entity—it's clear that a company can employ a large pool of IP addresses to crawl the web, and it's hard to figure out what IPs are under control of which company.

Or is it?

An idea suddenly hit me—a stray thought from the days when I was wearing a network admin hat I recalled that BGP routing basically knows the network boundaries for networks as it's based on policy routing via ASNs. I wonder if I could map IP addresses to ASNs? A quick search and I found my answer—yes! Within a few minutes, I had converted a list of 24,654 unique IP addresses to 1,490 unique networks, I was then able to rework my initial query to include the ASN (or rather, the human readable version instead of just the number):

Requests per IP/ASN
IP	Requests	AS
4.231.104.62	43242	MICROSOFT-CORP-MSN-AS-BLOCK, US
198.100.155.33	26650	OVH, FR
66.55.200.246	9057	BIDDEFORD1, US
74.80.208.170	8631	CSTL, US
74.80.208.59	8407	CSTL, US
216.244.66.239	5998	WOW, US
4.227.36.126	5832	MICROSOFT-CORP-MSN-AS-BLOCK, US
20.171.207.130	5817	MICROSOFT-CORP-MSN-AS-BLOCK, US
8.29.198.26	4946	FEEDLY-DEVHD, US
8.29.198.25	4807	FEEDLY-DEVHD, US

Now, I was curious as to how they identified themselves, so I reran the query to include the user agent string. The top eight identified themselves consistently:

Requests per Agent
Agent	Requests
Go-http-client/2.0	43236
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/132.0.0.0 Safari/537.36	26650
WF search/Nutch-1.12	9057
Mozilla/5.0 (compatible; ImagesiftBot; +imagesift.com)	8631
Mozilla/5.0 (compatible; ImagesiftBot; +imagesift.com)	8407
Mozilla/5.0 (compatible; DotBot/1.2; +https://opensiteexplorer.org/dotbot; help@moz.com)	5998
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot)	5832
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot)	5817

The last two, however had a changing user agent string:

Identifiers for 8.29.198.26
Agent	Requests
Feedly/1.0 (+https://feedly.com/poller.html; 16 subscribers; )	1667
Feedly/1.0 (+https://feedly.com/poller.html; 6 subscribers; )	1419
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 6 subscribers; )	938
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 16 subscribers; )	811
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 8 subscribers; )	94
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 37 subscribers; )	17

Identifiers for 8.29.198.25
Agent	Requests
Feedly/1.0 (+https://feedly.com/poller.html; 16 subscribers; )	1579
Feedly/1.0 (+https://feedly.com/poller.html; 6 subscribers; )	1481
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 6 subscribers; )	905
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 16 subscribers; )	741
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 8 subscribers; )	90
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 37 subscribers; )	11

I'm not sure what the difference is between polling and fetching (checking the URLs shows two identical pages, only differing in “Poller” and “Fetcher.” But looking deeper into that is for another post.

The next request I did was to see how many IPs (that hit my site in February) map to a particular ASN, and the top 10 are:

IPs per AS
AS	Count
ALIBABA-CN-NET Alibaba US Technology Co., Ltd., CN	4034
AMAZON-02, US	1733
HWCLOUDS-AS-AP HUAWEI CLOUDS, HK	1527
GOOGLE-CLOUD-PLATFORM, US	996
COMCAST-7922, US	895
AMAZON-AES, US	719
TENCENT-NET-AP-CN Tencent Building, Kejizhongyi Avenue, CN	635
MICROSOFT-CORP-MSN-AS-BLOCK, US	615
AS-VULTR, US	599
ATT-INTERNET4, US	472

So Alibaba US crawled my site from 4,034 different IP addresses—I haven't done the query to figure out how many requests each ASN did, but it should be a straightforward thing to just replace IP address with the ASN to get a better count of which company is crawling my site the hardest.

And now I'm thinking, I wonder if instead of a form of ad-hoc banning of single IP addresses, or blocking huge swaths of IP addresses (like 47.0.0.0/8, it might not be better to block per ASN? The IP to ASN mapping service I found makes it quite easy to get the ASN of an IP address (and to map the ASN to an human-readable name), Instead of, for example, blocking 101.32.0.0/16, 119.28.0.0/16, 43.128.0.0/14, 43.153.0.0/16 and 49.51.0.0/16 (which isn't an exaustive list by any means) just block IPs belonging to ASN 132203, otherwise known as “TENCENT-NET-AP-CN Tencent Building, Kejizhongyi Avenue, CN.”

I don't know how effective that idea is, but the IP-to-ASN site I found does offer the information via DNS, so it shouldn't be that hard to do.

Discussions about this entry

Lazy Reading for 2025/04/13 – DragonFly BSD Digest

The Boston Diaries

Friday, March 21, 2025

A different approach to blocking bad webbots by IP address

Obligatory Sidebar Links

Discussions about this entry

Obligatory Picture

Obligatory Contact Info

Obligatory Feeds

Obligatory Links

Obligatory Miscellaneous

Obligatory AI Disclaimer