The Boston Diaries

The ongoing saga of a programmer who doesn't live in Boston, nor does he even like Boston, but yet named his weblog/journal “The Boston Diaries.”

Go figure.

Tuesday, January 11, 2022

Let's look at some bots that aren't the MJ12Bot

I think it's time I stop blogging about work after my previosu post. Work is getting a tad too depressing to think about and my cynical side is saying that it won't matter where I go, it'd be more or less the same with a higher probability of forced Microsoft Windows use. So instead of that depressing topic, let's take a look at something much lighter and less depressing—the current state of Internet robots crawling my various sites!

Two weeks later and there are still bots attempting to follow endless redirections. I thought maybe I could attempt to figure out a contact, but alas, they're coming from all over the place (and yes, I'm finally naming IP addresses):

Top 20 Gemini based bots caught in redirection Hell
IP address # requests
18.134.208.136 933
18.132.248.127 850
3.8.92.131 817
18.169.194.52 745
3.8.210.87 728
13.40.97.54 715
18.170.56.106 713
3.8.134.65 682
35.176.22.93 681
18.130.231.183 681
13.40.67.85 667
13.40.137.233 666
18.132.46.166 659
3.8.24.209 641
18.170.107.207 637
13.40.155.157 634
35.178.170.215 577
18.130.216.34 573
13.40.145.207 572
35.179.76.79 564

They're all pretty much from Amazon Web Services so who knows who is running these bots. Just blocking them is too easy a solution—at this point, I'd like to do something to get their attention (as if thousands of links they are crawling are suddenly listed as “gone” isn't enough of a clue). I don't necessarily mind bots crawling my sites, unless they're doing stupid things. I shall have to think on this a bit more.

I also had high hopes that I could stop empty requests to my Gemini server (which isn't allowed at all by the specification) by returning a non-standard response code with the text “Not a gopher server” but alas, that is still happening. Does nobody bother checking results of their bots running? I guess not.

And speaking of gopher, it's better there than Gemini. Yes, there are a few agents that are attempting to use TLS, but fortunately, they cache previous failures so it's not every request. There are a few bots out there trying to exploit RDP (not much I can do about those) and a few that are confused into thinking my gopher site is actually my Gemini site sans TLS (What?). But I can live with 155 failed gopher requests out of 10,423 over the past month.

And while I'm checking bots, I can't forget the web crawlers. And not much has changed on that front since July 2019 except that MJ12Bot has kept their promise never to crawl my site again. The Knowledge AI (which I cannot find any information on) is still the number one agent, with 68,000 requests in Debtember 2021, followed by 21,000 requests from Amazonbot. And it seems that the bots in general are making fewer requests to non-existant pages (I mean, back in June 2019, The Knowledge AI made 170 bad requests; last month, 1).

So, with the exception of bots stuck in redirection Hell in Gemini, things on the crawler front are looking pretty good.

Obligatory Picture

[The future's so bright, I gotta wear shades]

Obligatory Contact Info

Obligatory Feeds

Obligatory Links

Obligatory Miscellaneous

You have my permission to link freely to any entry here. Go ahead, I won't bite. I promise.

The dates are the permanent links to that day's entries (or entry, if there is only one entry). The titles are the permanent links to that entry only. The format for the links are simple: Start with the base link for this site: https://boston.conman.org/, then add the date you are interested in, say 2000/08/01, so that would make the final URL:

https://boston.conman.org/2000/08/01

You can also specify the entire month by leaving off the day portion. You can even select an arbitrary portion of time.

You may also note subtle shading of the links and that's intentional: the “closer” the link is (relative to the page) the “brighter” it appears. It's an experiment in using color shading to denote the distance a link is from here. If you don't notice it, don't worry; it's not all that important.

It is assumed that every brand name, slogan, corporate name, symbol, design element, et cetera mentioned in these pages is a protected and/or trademarked entity, the sole property of its owner(s), and acknowledgement of this status is implied.

Copyright © 1999-2024 by Sean Conner. All Rights Reserved.