Friday, September 26, 2025
Yet more notes on web bot activity
For the past few months, every other week my server (which hosts this blog) would just go crazy for a day and require a full reboot to get back to normal. I haven't tracked down a root cause for this, but I do suspect it has to do with web bot activity increasing over the past few months. I ran a query over the logs for August, generating the number of requests per second and here are the top ten results:
timestamp | host | RPS |
---|---|---|
26/Aug/2025:03:26:36 -0400 | 76.14.125.194 | 740 |
26/Aug/2025:03:26:29 -0400 | 76.14.125.194 | 735 |
26/Aug/2025:03:26:35 -0400 | 76.14.125.194 | 697 |
26/Aug/2025:03:26:37 -0400 | 76.14.125.194 | 693 |
26/Aug/2025:03:25:54 -0400 | 76.14.125.194 | 666 |
26/Aug/2025:03:25:53 -0400 | 76.14.125.194 | 607 |
26/Aug/2025:03:26:28 -0400 | 76.14.125.194 | 589 |
26/Aug/2025:03:26:38 -0400 | 76.14.125.194 | 576 |
26/Aug/2025:03:26:17 -0400 | 76.14.125.194 | 574 |
26/Aug/2025:03:25:49 -0400 | 76.14.125.194 | 539 |
Websites like Google or MyLinkedFaceTikInstaPinMeTokBookTrestSpaceGramInWe might be able to handle loads like this, but I'm running a blog on a single server. These numbers are insane! Fortunately, this level of activity didn't last for long, but it certainly made it “interesting” on my server for a few minutes:
timestamp | RPM |
---|---|
03:23 | 27 |
03:24 | 72 |
03:25 | 4752 |
03:26 | 11131 |
03:27 | 1185 |
03:28 | 58 |
03:29 | 26 |
It's looking like spikes in activity might be a reason for my server freaking out.
Apache doesn't come with a way to limit IP connections.
A search lead me to mod_limitipconn
,
a simple module that limits an IP address to a maximum number of concurrent connections.
There's nothing about rate limiting per se,
but it can't hurt,
and it's a simple enough to install.
So earlier this week, I installed it. I set a maximum connection limit of 30—that is, no single IP address can connect more than 30 times concurrently. I just picked a high enough number (possibly too high) to still allow legitimate traffic through while keeping the worst abuse away. The code as downloaded will return a “503 No Service” when it kicks in, but I changed it to return a “429 Too many requests” which better reflects the actual situation (I think the code was originally written before 429 was a valid response code).
And it's working. It's already caught 18 bots (or rather, bots with distinct IP addresses), and they are all from the same ASN: GOOGLE-CLOUD-PLATFORM, US (and the user agents are all obviously forged). But what's curious about these is that a subset of the requests include a referrer URL. Most browsers these days restrict sending the referring link, or outright don't send it at all (to respect privacy). So to see them is unusual by a web bot.
Even more curious is these referring links have nothing to do with the link being referenced. There are, so far this month, 147 requests from the GOOGLE-CLOUD-PLATFORM ASN sending a referrer to Slashdot. And I don't mean to a page on Slashdot, but to the main page of Slashdot. There are also referrers to Cisco (on 201 requests), Petrobras (on 581 requests), NBC News (on 221 requests) among 435 other websites being referenced on requests. I don't understand the reasoning here. It's not like I'll let through a request just because it came from Slashdot. I don't publish referring links. I know sites used to publish referring links back in the day, and spammers used this to gain Page Rank for their own pages (or for their clients) but that can't be worth it these days? Can it? Are these still old bots running but long forgotten? What is the angle here?
Anyway, I'll have to wait and see if limiting IP connections will solve my server issues. I do hope that's all it is.
Oh, it's a bug on my side that prevents full conditional requests
I'm still pouring through web sever log files and I'm noticing that many of the feed readers fetching my various feeds files aren't using conditional requests.
I was in the process of writing to the author of one of them describing the oversight when I noticed that particular feed reader using both methods of conditional requests:
the If-Modified-Since:
header and the If-None-Match
header in conjunction with a HEAD
request.
I thought I should test that with my web server,
just to make sure it was not a bug on my side.
Specifically,
an Apache bug where compressed output interferes with the If-None-Match
method.
There is a workaround though:
RequestHeader edit "If-None-Match" '^"((.*)-gzip)"$' '"$1", "$2"'
That rewrites the incoming If-None-Match
header to work around the bug.
Now maybe that whole conditional request thang with my webserver will now work properly.
Sigh.