Tuesday, April 07, 2026
Observations on blocking various webbots
Going through the logs from my web server for March, I noticed that 26% of all requests resulted in a failed client request (stuff like “404 Not Found” or “429 Too Many Requests”). These requests are more annoying than they are debilitating, but ideally, I would love a way to crash these bots as they're mostly scanning my site for exploits; fully 50% are just scanning for various PHP based scripts (which I don't use at all) and the rest for a variety of other files that can lead to exploits. But short of that, it would mean having to block such requests at the firewall as there's no point to really switching a response from “404 Not Found” to “403 Forbidden”—the bot authors won't change their methods just because the status changes. Such scanning is fully automated and as stateless as possible (given modern infrastructure, a complete scan of the Internet can be done easily within a week).
Identifying such bad bots wasn't hard. One simple method I did was to track all the requests made last month and if a unique IP address made at least five requests, and as long as there were more client errors (statuses 400‥499) than good responses (200‥399), it was counted as “blockable.” That easily caught the most egregious bots with no false positives as far as I could see.
But such a method would require tracking around 100,000 to 200,000 unique IP per month in some way and then blocking the bad ones
(about 10% of all unique IPs).
I've learned over the years that iptables,
the firewall system I use,
has some hard limits to the number of rules in a given chain
(which I found out the hard way when blocking ssh attempts;
I gave up and now restrict ssh to only a few hosts).
And like I said, this is just an annoyance and not an existential threat, so setting up such a system to track IPs and block a certain subset while at the same time rotating out old blocks is just not worth the the resulting Rube Goldbergesque machinery required to handle it. Been there, done that, not worth the tee shirt.
The next thought I had was maybe I could identify bad bots that don't properly identify themselves with the new hot header curtesy of Google: Sec-CH-UA.
Google's Chrome browser
(which has I think a 80% or more market share)
will send this header.
So the thought is that if the User-Agent header mentions “Chrome” then check to see if the request also includes the Sec-CH-UA header and if not,
then it's a bot so send back a “403 Forbidden” result.
It won't necessarily stop the bots,
especially the ones feeding AI,
but it does send a signal.
So I added support to my web server to record and log any request that claims to be “Chrome” and does not include a Sec-CH-UA header,
and let it run for several days to see if it might be worth it.
The results are very disappointing—85% of such requests were from feed readers. Well … so much for that idea.
![Oh Chrismtas Tree! My Christmas Tree! Rise up and hear the bells! [Self-portrait with a Christmas Tree] Oh Chrismtas Tree! My Christmas Tree! Rise up and hear the bells!](https://www.conman.org/people/spc/about/2025/1203.t.jpg)