The Boston Diaries

Tuesday, January 11, 2022

Let's look at some bots that aren't the MJ12Bot

I think it's time I stop blogging about work after my previous post. Work is getting a tad too depressing to think about and my cynical side is saying that it won't matter where I go, it'd be more or less the same with a higher probability of forced Microsoft Windows use. So instead of that depressing topic, let's take a look at something much lighter and less depressing—the current state of Internet robots crawling my various sites!

Two weeks later and there are still bots attempting to follow endless redirections. I thought maybe I could attempt to figure out a contact, but alas, they're coming from all over the place (and yes, I'm finally naming IP addresses):

Top 20 Gemini based bots caught in redirection Hell
IP address	# requests
18.134.208.136	933
18.132.248.127	850
3.8.92.131	817
18.169.194.52	745
3.8.210.87	728
13.40.97.54	715
18.170.56.106	713
3.8.134.65	682
35.176.22.93	681
18.130.231.183	681
13.40.67.85	667
13.40.137.233	666
18.132.46.166	659
3.8.24.209	641
18.170.107.207	637
13.40.155.157	634
35.178.170.215	577
18.130.216.34	573
13.40.145.207	572
35.179.76.79	564

They're all pretty much from Amazon Web Services so who knows who is running these bots. Just blocking them is too easy a solution—at this point, I'd like to do something to get their attention (as if thousands of links they are crawling are suddenly listed as “gone” isn't enough of a clue). I don't necessarily mind bots crawling my sites, unless they're doing stupid things. I shall have to think on this a bit more.

I also had high hopes that I could stop empty requests to my Gemini server (which isn't allowed at all by the specification) by returning a non-standard response code with the text “Not a gopher server” but alas, that is still happening. Does nobody bother checking results of their bots running? I guess not.

And speaking of gopher, it's better there than Gemini. Yes, there are a few agents that are attempting to use TLS, but fortunately, they cache previous failures so it's not every request. There are a few bots out there trying to exploit RDP (not much I can do about those) and a few that are confused into thinking my gopher site is actually my Gemini site sans TLS (What?). But I can live with 155 failed gopher requests out of 10,423 over the past month.

And while I'm checking bots, I can't forget the web crawlers. And not much has changed on that front since July 2019 except that MJ12Bot has kept their promise never to crawl my site again. The Knowledge AI (which I cannot find any information on) is still the number one agent, with 68,000 requests in Debtember 2021, followed by 21,000 requests from Amazonbot. And it seems that the bots in general are making fewer requests to non-existant pages (I mean, back in June 2019, The Knowledge AI made 170 bad requests; last month, 1).

So, with the exception of bots stuck in redirection Hell in Gemini, things on the crawler front are looking pretty good.

Tuesday, January 11, 2022

Let's look at some bots that aren't the MJ12Bot

Obligatory Picture

Obligatory Contact Info

Obligatory Feeds

Obligatory Links

Obligatory Miscellaneous

Obligatory AI Disclaimer