The Boston Diaries

The ongoing saga of a programmer who doesn't live in Boston, nor does he even like Boston, but yet named his weblog/journal “The Boston Diaries.”

Go figure.

Tuesday, July 09, 2019

How can a “commercial grade” web robot be so badly written?

Alex Schroeder was checking the status of web requests, and it made me wonder about the stats on my own server. One quick script later and I had some numbers:

Status of requests for boston.conman.org so far this month
Status result requests percent
Total - 64542 100.01
200 OKAY 53457 82.83
206 PARTIAL_CONTENT 12 0.02
301 MOVE_PERM 2421 3.75
304 NOT_MODIFIED 6185 9.58
400 BAD_REQUEST 101 0.16
401 UNAUTHORIZED 147 0.23
404 NOT_FOUND 2000 3.10
405 METHOD_NOT_ALLOWED 41 0.06
410 GONE 5 0.01
500 INTERNAL_ERROR 173 0.27

I'll have to check the INTERNAL_ERRORs and into those 12 PARTIAL_CONTENT responses, but the rest seem okay. I was curious to see what I didn't have that was being requested, when I noticed that the MJ12Bot was producing the majority of NOT_FOUND responses.

Yes, sadly, most of the traffic around here is from bots. Lots and lots of bots.

Top agents requesting pages
requests percentage user agent
47721 74 Total (out of 64542)
16952 26 The Knowledge AI
9159 14 Mozilla/5.0 (compatible; SemrushBot/3~bl; +http://www.semrush.com/bot.html)
5633 9 Mozilla/5.0 (compatible; VelenPublicWebCrawler/1.0; +https://velen.io)
4272 7 Mozilla/5.0 (compatible; AhrefsBot/6.1; +http://ahrefs.com/robot/)
4046 6 Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
3170 5 Mozilla/5.0 (compatible; Go-http-client/1.1; +centurybot9@gmail.com)
2146 3 Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/)
1197 2 Mozilla/5.0 (compatible; DotBot/1.1; http://www.opensiteexplorer.org/dotbot, help@moz.com)
1146 2 istellabot/t.1.13

But it's been that way for years now. C'est la vie.

So I started looking closer at MJ12Bot and the requests it was generating, and … they were odd:

And so on. As they describe it:

Why do you keep crawling 404 or 301 pages?

We have a long memory and want to ensure that temporary errors, website down pages or other temporary changes to sites do not cause irreparable changes to your site profile when they shouldn't. Also if there are still links to these pages they will continue to be found and followed. Google have published a statement since they are also asked this question, their reason is of course the same as ours and their answer can be found here: Google 404 policy.

But those requests? They have a real issue with their bot. Looking over the requests, I see that they're pages I've linked to, but for whatever reason, their bot is making requests for remote pages on my server. Worse yet, they're quoted! The %22 parts—that's an encoded double quote. It's as if their bot saw “<A HREF="http://www.thomasedison.com">” and treated it as not only a link on my server, but escaped the quotes when making the request!

Pssst! MJ12Bot! Quotes are optional! Both “<A HREF="http://www.thomasedison.com">” and “<A HREF=http://www.thomasedison.com>” are equivalent!

Sigh.

Annoyed, I sent them the following email:

From
Sean Conner <sean@conman.org>
To
bot@majestic12.co.uk
Subject
Your robot is making bogus requests to my webserver
Date
Tue, 9 Jul 2019 17:49:02 -0400

I've read your page on the mj12 bot, and I don't necessarily mind the 404s your bot generates, but I think there's a problem with your bot making totally bogus requests, such as:

//%22https://www.youtube.com/watch?v=LnxSTShwDdQ%5C%22
//%22https://www.zaxbys.com//%22
//%22/2003/11/%22
//%22gopher://auzymoto.net/0/glog/post0011/%22
//%22https://github.com/spc476/NaNoGenMo-2018/blob/master/valley.l/%22

I'm not a proxy server, so requesting a URL will not work, and even if I was a proxy server, the request itself is malformed so badly that I have to conclude your programmers are incompetent and don't care.

Could you at the very least fix your robot so it makes proper requests?

I then received a canned reply saying that they have, in fact, received my email and are looking into it.

Nice.

But I did a bit more investigation, and the results aren't pretty:

Requests and results for MJ12Bot
Status result number percentage
Total - 2164 100.00
200 OKAY 505 23.34
301 MOVE_PERM 4 0.18
404 NOT_FOUND 1655 76.48

So not only are they responsible for 83% of the bad requests I've seen, but nearly 77% of the requests they make are bad!

Just amazing programmers they have!

Obligatory Picture

[The future's so bright, I gotta wear shades]

Obligatory Contact Info

Obligatory Feeds

Obligatory Links

Obligatory Miscellaneous

You have my permission to link freely to any entry here. Go ahead, I won't bite. I promise.

The dates are the permanent links to that day's entries (or entry, if there is only one entry). The titles are the permanent links to that entry only. The format for the links are simple: Start with the base link for this site: https://boston.conman.org/, then add the date you are interested in, say 2000/08/01, so that would make the final URL:

https://boston.conman.org/2000/08/01

You can also specify the entire month by leaving off the day portion. You can even select an arbitrary portion of time.

You may also note subtle shading of the links and that's intentional: the “closer” the link is (relative to the page) the “brighter” it appears. It's an experiment in using color shading to denote the distance a link is from here. If you don't notice it, don't worry; it's not all that important.

It is assumed that every brand name, slogan, corporate name, symbol, design element, et cetera mentioned in these pages is a protected and/or trademarked entity, the sole property of its owner(s), and acknowledgement of this status is implied.

Copyright © 1999-2024 by Sean Conner. All Rights Reserved.