The Boston Diaries

The ongoing saga of a programmer who doesn't live in Boston, nor does he even like Boston, but yet named his weblog/journal “The Boston Diaries.”

Go figure.

Friday, March 21, 2025

A different approach to blocking bad webbots by IP address

Web crawlers for LLM-based companies, as well as some specific solutions to blocking them, have been making the rounds in the past few days. I was curious to see just how many were hitting my web site, so I ran a few queries over the log files. To ensure consistent results, I decided to query the log file for last month:

Quick summary of results for February 2025
total requests 468439
unique IPs 24654
Top 10 requests per IP
IP Requests
4.231.104.62 43242
198.100.155.33 26650
66.55.200.246 9057
74.80.208.170 8631
74.80.208.59 8407
216.244.66.239 5998
4.227.36.126 5832
20.171.207.130 5817
8.29.198.26 4946
8.29.198.25 4807

(Note: I'm not concerned about protecting any privacy here—given the number of results, there is no way these are any individual. These are all companies hitting my site, and if companies are mining their data for my information, I'm going to do the same to them. So there.)

But it became apparent that it's hard to determine which requests are coming from a single entity—it's clear that a company can employ a large pool of IP addresses to crawl the web, and it's hard to figure out what IPs are under control of which company.

Or is it?

An idea suddenly hit me—a stray thought from the days when I was wearing a network admin hat I recalled that BGP routing basically knows the network boundaries for networks as it's based on policy routing via ASNs. I wonder if I could map IP addresses to ASNs? A quick search and I found my answer—yes! Within a few minutes, I had converted a list of 24,654 unique IP addresses to 1,490 unique networks, I was then able to rework my initial query to include the ASN (or rather, the human readable version instead of just the number):

Requests per IP/ASN
IP Requests AS
4.231.104.62 43242 MICROSOFT-CORP-MSN-AS-BLOCK, US
198.100.155.33 26650 OVH, FR
66.55.200.246 9057 BIDDEFORD1, US
74.80.208.170 8631 CSTL, US
74.80.208.59 8407 CSTL, US
216.244.66.239 5998 WOW, US
4.227.36.126 5832 MICROSOFT-CORP-MSN-AS-BLOCK, US
20.171.207.130 5817 MICROSOFT-CORP-MSN-AS-BLOCK, US
8.29.198.26 4946 FEEDLY-DEVHD, US
8.29.198.25 4807 FEEDLY-DEVHD, US

Now, I was curious as to how they identified themselves, so I reran the query to include the user agent string. The top eight identified themselves consistently:

Requests per Agent
Agent Requests
Go-http-client/2.0 43236
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/132.0.0.0 Safari/537.36 26650
WF search/Nutch-1.12 9057
Mozilla/5.0 (compatible; ImagesiftBot; +imagesift.com) 8631
Mozilla/5.0 (compatible; ImagesiftBot; +imagesift.com) 8407
Mozilla/5.0 (compatible; DotBot/1.2; +https://opensiteexplorer.org/dotbot; help@moz.com) 5998
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot) 5832
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot) 5817

The last two, however had a changing user agent string:

Identifiers for 8.29.198.26
Agent Requests
Feedly/1.0 (+https://feedly.com/poller.html; 16 subscribers; ) 1667
Feedly/1.0 (+https://feedly.com/poller.html; 6 subscribers; ) 1419
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 6 subscribers; ) 938
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 16 subscribers; ) 811
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 8 subscribers; ) 94
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 37 subscribers; ) 17
Identifiers for 8.29.198.25
Agent Requests
Feedly/1.0 (+https://feedly.com/poller.html; 16 subscribers; ) 1579
Feedly/1.0 (+https://feedly.com/poller.html; 6 subscribers; ) 1481
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 6 subscribers; ) 905
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 16 subscribers; ) 741
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 8 subscribers; ) 90
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 37 subscribers; ) 11

I'm not sure what the difference is between polling and fetching (checking the URLs shows two identical pages, only differing in “Poller” and “Fetcher.” But looking deeper into that is for another post.

The next request I did was to see how many IPs (that hit my site in February) map to a particular ASN, and the top 10 are:

IPs per AS
AS Count
ALIBABA-CN-NET Alibaba US Technology Co., Ltd., CN 4034
AMAZON-02, US 1733
HWCLOUDS-AS-AP HUAWEI CLOUDS, HK 1527
GOOGLE-CLOUD-PLATFORM, US 996
COMCAST-7922, US 895
AMAZON-AES, US 719
TENCENT-NET-AP-CN Tencent Building, Kejizhongyi Avenue, CN 635
MICROSOFT-CORP-MSN-AS-BLOCK, US 615
AS-VULTR, US 599
ATT-INTERNET4, US 472

So Alibaba US crawled my site from 4,034 different IP addresses—I haven't done the query to figure out how many requests each ASN did, but it should be a straightforward thing to just replace IP address with the ASN to get a better count of which company is crawling my site the hardest.

And now I'm thinking, I wonder if instead of a form of ad-hoc banning of single IP addresses, or blocking huge swaths of IP addresses (like 47.0.0.0/8, it might not be better to block per ASN? The IP to ASN mapping service I found makes it quite easy to get the ASN of an IP address (and to map the ASN to an human-readable name), Instead of, for example, blocking 101.32.0.0/16, 119.28.0.0/16, 43.128.0.0/14, 43.153.0.0/16 and 49.51.0.0/16 (which isn't an exaustive list by any means) just block IPs belonging to ASN 132203, otherwise known as “TENCENT-NET-AP-CN Tencent Building, Kejizhongyi Avenue, CN.”

I don't know how effective that idea is, but the IP-to-ASN site I found does offer the information via DNS, so it shouldn't be that hard to do.

Obligatory Picture

Dad was resigned to the fact that I was, indeed, a landlubber, and turned the boat around yet again …

Obligatory Contact Info

Obligatory Feeds

Obligatory Links

Obligatory Miscellaneous

Obligatory AI Disclaimer

No AI was used in the making of this site, unless otherwise noted.

You have my permission to link freely to any entry here. Go ahead, I won't bite. I promise.

The dates are the permanent links to that day's entries (or entry, if there is only one entry). The titles are the permanent links to that entry only. The format for the links are simple: Start with the base link for this site: https://boston.conman.org/, then add the date you are interested in, say 2000/08/01, so that would make the final URL:

https://boston.conman.org/2000/08/01

You can also specify the entire month by leaving off the day portion. You can even select an arbitrary portion of time.

You may also note subtle shading of the links and that's intentional: the “closer” the link is (relative to the page) the “brighter” it appears. It's an experiment in using color shading to denote the distance a link is from here. If you don't notice it, don't worry; it's not all that important.

It is assumed that every brand name, slogan, corporate name, symbol, design element, et cetera mentioned in these pages is a protected and/or trademarked entity, the sole property of its owner(s), and acknowledgement of this status is implied.

Copyright © 1999-2025 by Sean Conner. All Rights Reserved.