Tuesday, January 11, 2022
Let's look at some bots that aren't the MJ12Bot
I think it's time I stop blogging about work after my previosu post. Work is getting a tad too depressing to think about and my cynical side is saying that it won't matter where I go, it'd be more or less the same with a higher probability of forced Microsoft Windows use. So instead of that depressing topic, let's take a look at something much lighter and less depressing—the current state of Internet robots crawling my various sites!
Two weeks later and there are still bots attempting to follow endless redirections. I thought maybe I could attempt to figure out a contact, but alas, they're coming from all over the place (and yes, I'm finally naming IP addresses):
|IP address||# requests|
They're all pretty much from Amazon Web Services so who knows who is running these bots. Just blocking them is too easy a solution—at this point, I'd like to do something to get their attention (as if thousands of links they are crawling are suddenly listed as “gone” isn't enough of a clue). I don't necessarily mind bots crawling my sites, unless they're doing stupid things. I shall have to think on this a bit more.
I also had high hopes that I could stop empty requests to my Gemini server (which isn't allowed at all by the specification) by returning a non-standard response code with the text “Not a gopher server” but alas, that is still happening. Does nobody bother checking results of their bots running? I guess not.
And speaking of gopher, it's better there than Gemini. Yes, there are a few agents that are attempting to use TLS, but fortunately, they cache previous failures so it's not every request. There are a few bots out there trying to exploit RDP (not much I can do about those) and a few that are confused into thinking my gopher site is actually my Gemini site sans TLS (What?). But I can live with 155 failed gopher requests out of 10,423 over the past month.
And while I'm checking bots, I can't forget the web crawlers. And not much has changed on that front since July 2019 except that MJ12Bot has kept their promise never to crawl my site again. The Knowledge AI (which I cannot find any information on) is still the number one agent, with 68,000 requests in Debtember 2021, followed by 21,000 requests from Amazonbot. And it seems that the bots in general are making fewer requests to non-existant pages (I mean, back in June 2019, The Knowledge AI made 170 bad requests; last month, 1).
So, with the exception of bots stuck in redirection Hell in Gemini, things on the crawler front are looking pretty good.