Thursday, July 11, 2019

Yet more observations about the MJ12Bot

I received a reply about MJ12Bot! Let's see …

Sean Conner <>
[Majestic] Re: Your robot is making bogus requests to my webserver
Thu, 11 Jul 2019 08:34:13 +0000

Oh … really? Sigh.

Anyway, the only questionable bit in the email was this line:

The prefix // in a link of course refers to the same site as the current page, over the same protocol, so this is why these URLs are being requested back from your server.

which is … somewhat correct. It does mean “use the same protocol” but the double slash denotes a “network path reference” (RFC-3986, section 4.2) where, at a minimum, a hostname is required. If this is just a misunderstanding on the developers' part, it could explain the behavior I'm seeing.

And speaking of behavior, I decided to check the logs (again, using last month) one last time for two reports.

User Agents, sorted by most requests, for June 2019
404 (not found) 200 (okay) Total requests User agent
170 42676 46334 The Knowledge AI
21 36088 38097 Mozilla/5.0 (compatible; SemrushBot/3~bl; +
46 16633 17130 Mozilla/5.0 (compatible; BLEXBot/1.0; +
5 15840 15928 Mozilla/5.0 (compatible; AhrefsBot/6.1; +
3 12304 12353 Mozilla/5.0 (compatible; bingbot/2.0; +
36 8412 8929 Mozilla/5.0 (compatible;; +
7 8428 8908 Gigabot
5680 2015 7872 Mozilla/5.0 (compatible; MJ12bot/v1.4.8;
28 6604 6942 Barkrowler/0.9 (+
0 4705 4737 istellabot/t.1.13
User Agents, sorted by most bad requests (404), for June 2019
404 (not found) 200 (okay) Total requests User agent
5680 2015 7872 Mozilla/5.0 (compatible; MJ12bot/v1.4.8;
656 109 768 Mozilla/5.0 (compatible; MJ12bot/v1.4.7;
177 45 553 Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2)
170 42676 46334 The Knowledge AI
120 0 120 Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0)

(Note: The number of 404s and 200s might not add up to the total—there might be other requests that returned a different status not reported here.)

MJ12Bot is the 8th most active client on my site, yet it has the top two spots for bad requests, beating out #3 by over an order of magnitude (35 times the amount in fact).

But I don't have to worry about it since the email also stated they removed my site from their crawl list. Okay … I guess?

