The Boston Diaries

The ongoing saga of a programmer who doesn't live in Boston, nor does he even like Boston, but yet named his weblog/journal “The Boston Diaries.”

Go figure.

Thursday, July 11, 2019

Yet more observations about the MJ12Bot

I received a reply about MJ12Bot! Let's see …

From
Majestic <XXXXX­XXXXX­XXXXX­XXXXX­XXXXX­XXXXX>
To
Sean Conner <sean@conman.org>
Subject
[Majestic] Re: Your robot is making bogus requests to my webserver
Date
Thu, 11 Jul 2019 08:34:13 +0000

##- Please type your reply above this line -##

Oh … really? Sigh.

Anyway, the only questionable bit in the email was this line:

The prefix // in a link of course refers to the same site as the current page, over the same protocol, so this is why these URLs are being requested back from your server.

which is … somewhat correct. It does mean “use the same protocol” but the double slash denotes a “network path reference” (RFC-3986, section 4.2) where, at a minimum, a hostname is required. If this is just a misunderstanding on the developers' part, it could explain the behavior I'm seeing.

And speaking of behavior, I decided to check the logs (again, using last month) one last time for two reports.

User Agents, sorted by most requests, for June 2019
404 (not found) 200 (okay) Total requests User agent
170 42676 46334 The Knowledge AI
21 36088 38097 Mozilla/5.0 (compatible; SemrushBot/3~bl; +http://www.semrush.com/bot.html)
46 16633 17130 Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)
5 15840 15928 Mozilla/5.0 (compatible; AhrefsBot/6.1; +http://ahrefs.com/robot/)
3 12304 12353 Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
36 8412 8929 Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)
7 8428 8908 Gigabot
5680 2015 7872 Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/)
28 6604 6942 Barkrowler/0.9 (+http://www.exensa.com/crawl)
0 4705 4737 istellabot/t.1.13
User Agents, sorted by most bad requests (404), for June 2019
404 (not found) 200 (okay) Total requests User agent
5680 2015 7872 Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/)
656 109 768 Mozilla/5.0 (compatible; MJ12bot/v1.4.7; http://mj12bot.com/)
177 45 553 Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2)
170 42676 46334 The Knowledge AI
120 0 120 Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0)

(Note: The number of 404s and 200s might not add up to the total—there might be other requests that returned a different status not reported here.)

MJ12Bot is the 8th most active client on my site, yet it has the top two spots for bad requests, beating out #3 by over an order of magnitude (35 times the amount in fact).

But I don't have to worry about it since the email also stated they removed my site from their crawl list. Okay … I guess?

Obligatory Picture

[It's the most wonderful time of the year!]

Obligatory Links

Obligatory Miscellaneous

You have my permission to link freely to any entry here. Go ahead, I won't bite. I promise.

The dates are the permanent links to that day's entries (or entry, if there is only one entry). The titles are the permanent links to that entry only. The format for the links are simple: Start with the base link for this site: http://boston.conman.org/, then add the date you are interested in, say 2000/08/01, so that would make the final URL:

http://boston.conman.org/2000/08/01

You can also specify the entire month by leaving off the day portion. You can even select an arbitrary portion of time.

You may also note subtle shading of the links and that's intentional: the “closer” the link is (relative to the page) the “brighter” it appears. It's an experiment in using color shading to denote the distance a link is from here. If you don't notice it, don't worry; it's not all that important.

It is assumed that every brand name, slogan, corporate name, symbol, design element, et cetera mentioned in these pages is a protected and/or trademarked entity, the sole property of its owner(s), and acknowledgement of this status is implied.

Copyright © 1999-2019 by Sean Conner. All Rights Reserved.