Tuesday, July 09, 2019
How can a “commercial grade” web robot be so badly written?
Alex Schroeder was checking the status of web requests, and it made me wonder about the stats on my own server. One quick script later and I had some numbers:
Status | result | requests | percent |
---|---|---|---|
Total | - | 64542 | 100.01 |
200 | OKAY | 53457 | 82.83 |
206 | PARTIAL_CONTENT | 12 | 0.02 |
301 | MOVE_PERM | 2421 | 3.75 |
304 | NOT_MODIFIED | 6185 | 9.58 |
400 | BAD_REQUEST | 101 | 0.16 |
401 | UNAUTHORIZED | 147 | 0.23 |
404 | NOT_FOUND | 2000 | 3.10 |
405 | METHOD_NOT_ALLOWED | 41 | 0.06 |
410 | GONE | 5 | 0.01 |
500 | INTERNAL_ERROR | 173 | 0.27 |
I'll have to check the INTERNAL_ERROR
s and into those 12 PARTIAL_CONTENT
responses,
but the rest seem okay. I was curious to see what I didn't have that was being requested,
when I noticed that the MJ12Bot was producing the majority of NOT_FOUND
responses.
Yes, sadly, most of the traffic around here is from bots. Lots and lots of bots.
requests | percentage | user agent |
---|---|---|
47721 | 74 | Total (out of 64542) |
16952 | 26 | The Knowledge AI |
9159 | 14 | Mozilla/5.0 (compatible; SemrushBot/3~bl; +http://www.semrush.com/bot.html) |
5633 | 9 | Mozilla/5.0 (compatible; VelenPublicWebCrawler/1.0; +https://velen.io) |
4272 | 7 | Mozilla/5.0 (compatible; AhrefsBot/6.1; +http://ahrefs.com/robot/) |
4046 | 6 | Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) |
3170 | 5 | Mozilla/5.0 (compatible; Go-http-client/1.1; +centurybot9@gmail.com) |
2146 | 3 | Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/) |
1197 | 2 | Mozilla/5.0 (compatible; DotBot/1.1; http://www.opensiteexplorer.org/dotbot, help@moz.com) |
1146 | 2 | istellabot/t.1.13 |
But it's been that way for years now. C'est la vie.
So I started looking closer at MJ12Bot and the requests it was generating, and … they were odd:
//%22http://www.thomasedison.com//%22
//%22https://github.com/spc476/NaNoGenMo-2018/blob/master/run.lua/%22
//%22/2018/08/24.1/%22
//%22https://kottke.org/19/04/life-sized-lego-electronics/%22
And so on. As they describe it:
Why do you keep crawling 404 or 301 pages?
We have a long memory and want to ensure that temporary errors, website down pages or other temporary changes to sites do not cause irreparable changes to your site profile when they shouldn't. Also if there are still links to these pages they will continue to be found and followed. Google have published a statement since they are also asked this question, their reason is of course the same as ours and their answer can be found here: Google 404 policy.
But those requests?
They have a real issue with their bot.
Looking over the requests,
I see that they're pages I've linked to,
but for whatever reason,
their bot is making requests for remote pages on my server.
Worse yet,
they're quoted!
The %22
parts—that's an encoded double quote.
It's as if their bot saw “<A HREF="http://www.thomasedison.com">
” and treated it as not only a link on my server,
but escaped the quotes when making the request!
Pssst! MJ12Bot! Quotes are optional! Both “<A HREF="http://www.thomasedison.com">
” and
“<A HREF=http://www.thomasedison.com>
” are equivalent!
Sigh.
Annoyed, I sent them the following email:
- From
- Sean Conner <sean@conman.org>
- To
- bot@majestic12.co.uk
- Subject
- Your robot is making bogus requests to my webserver
- Date
- Tue, 9 Jul 2019 17:49:02 -0400
I've read your page on the mj12 bot, and I don't necessarily mind the 404s your bot generates, but I think there's a problem with your bot making totally bogus requests, such as:
//%22https://www.youtube.com/watch?v=LnxSTShwDdQ%5C%22 //%22https://www.zaxbys.com//%22 //%22/2003/11/%22 //%22gopher://auzymoto.net/0/glog/post0011/%22 //%22https://github.com/spc476/NaNoGenMo-2018/blob/master/valley.l/%22I'm not a proxy server, so requesting a URL will not work, and even if I was a proxy server, the request itself is malformed so badly that I have to conclude your programmers are incompetent and don't care.
Could you at the very least fix your robot so it makes proper requests?
I then received a canned reply saying that they have, in fact, received my email and are looking into it.
Nice.
But I did a bit more investigation, and the results aren't pretty:
Status | result | number | percentage |
---|---|---|---|
Total | - | 2164 | 100.00 |
200 | OKAY | 505 | 23.34 |
301 | MOVE_PERM | 4 | 0.18 |
404 | NOT_FOUND | 1655 | 76.48 |
So not only are they responsible for 83% of the bad requests I've seen, but nearly 77% of the requests they make are bad!
Just amazing programmers they have!
Wednesday, July 10, 2019
Some more observations about the MJ12Bot
I received another reply from MJ12Bot about their badly written bot and it just said the person responsible for handling enquiries was out of the office for the day and I should expect a reponse tomorrow. We shall see. In the mean time, I decided to check some of the other bots hitting my site and see how well they fare, request wise. And I'm using the logs from last month for this, so these results are for 30 days of traffic.
requests | percentage | user agent |
---|---|---|
167235 | 70 | Total (out of 239641) |
46334 | 19 | The Knowledge AI |
38097 | 16 | Mozilla/5.0 (compatible; SemrushBot/3~bl; +http://www.semrush.com/bot.html) |
17130 | 7 | Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/) |
15928 | 7 | Mozilla/5.0 (compatible; AhrefsBot/6.1; +http://ahrefs.com/robot/) |
12358 | 5 | Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) |
8929 | 4 | Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler) |
8908 | 4 | Gigabot |
7872 | 3 | Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/) |
6942 | 3 | Barkrowler/0.9 (+http://www.exensa.com/crawl) |
4737 | 2 | istellabot/t.1.13 |
So let's see some results:
Bot | 200 | % | 301 | % | 304 | % | 400 | % | 403 | % | 404 | % | 410 | % | 500 | % | Total | % |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
The Knowledge AI | 42676 | 92.1 | 3352 | 7.2 | 0 | 0.0 | 127 | 0.3 | 4 | 0.0 | 170 | 0.4 | 5 | 0.0 | 0 | 0.0 | 46334 | 100.0 |
SemrushBot/3~bl | 36088 | 94.7 | 1873 | 4.9 | 0 | 0.0 | 110 | 0.3 | 0 | 0.0 | 21 | 0.1 | 5 | 0.0 | 0 | 0.0 | 38097 | 100.0 |
BLEXBot/1.0 | 16633 | 97.1 | 208 | 1.2 | 124 | 0.7 | 114 | 0.7 | 0 | 0.0 | 46 | 0.3 | 5 | 0.0 | 0 | 0.0 | 17130 | 100.0 |
AhrefsBot/6.1 | 15840 | 99.4 | 78 | 0.5 | 0 | 0.0 | 4 | 0.0 | 0 | 0.0 | 5 | 0.0 | 0 | 0.0 | 1 | 0.0 | 15928 | 99.9 |
bingbot/2.0 | 12304 | 99.6 | 35 | 0.3 | 0 | 0.0 | 6 | 0.0 | 0 | 0.0 | 3 | 0.0 | 5 | 0.0 | 0 | 0.0 | 12353 | 99.9 |
MegaIndex.ru/2.0 | 8412 | 94.2 | 456 | 5.1 | 0 | 0.0 | 24 | 0.3 | 0 | 0.0 | 36 | 0.4 | 1 | 0.0 | 0 | 0.0 | 8929 | 100.0 |
Gigabot | 8428 | 94.6 | 448 | 5.0 | 0 | 0.0 | 23 | 0.3 | 0 | 0.0 | 7 | 0.1 | 2 | 0.0 | 0 | 0.0 | 8908 | 100.0 |
MJ12bot/v1.4.8 | 2015 | 25.6 | 175 | 2.2 | 0 | 0.0 | 2 | 0.0 | 0 | 0.0 | 5680 | 72.2 | 0 | 0.0 | 0 | 0.0 | 7872 | 100.0 |
Barkrowler/0.9 | 6604 | 95.1 | 300 | 4.3 | 0 | 0.0 | 10 | 0.1 | 0 | 0.0 | 28 | 0.4 | 0 | 0.0 | 0 | 0.0 | 6942 | 99.9 |
istellabot/t.1.13 | 4705 | 99.3 | 28 | 0.6 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 4 | 0.1 | 4737 | 100.0 |
Percentage wise of the top 10 bots hitting my blog (and in fact, these are the 10 ten clients hitting my blog) MJ12Bot is just bad at 72% bad requests. It's hard to say what the second worst one is, but I'll have to give it to “The Knowledge AI” bot (and my search-foo is failing me in finding anything about this one). Percentage wise, it's about on-par with the others, but some of its requests are also rather odd:
/%22
/%22https:/www.brevardnc.org/business-directory/5474/rockys-soda-shop/%22
/%22http:/brevardnc.org/%22
/%22https:/www.greenvillesc.gov/%22
/%22https:/en.m.wikipedia.org/wiki/Caesars_Head_State_Park/%22
/%22https:/www.transylvaniacounty.org/town-of-rosman/%22
It appears to be a similar problem as MJ12Bot, but one that doesn't happen nearly as often.
Now, this isn't to say I don't have some legitimate “not found“ (404) results. I did come across some actual valid 404 results on my own blog:
/2004/08/18/mias@speedy.com.pe
/2012/08/10/HREF
/2013/01/02/menamena
/2013/02/01/HREF
/2014/05/04/HREF
/2015/02/10/B000FBJCJE
/2015/07/10/mailtp:admin@macropayday.com
Some are typos, some are placeholders for links I forgot to add. And those I can fix. I just wish someone would fix MJ12Bot. Not because it's bogging down my site with unwanted traffic, but because it's just bad at what it does.
Thursday, July 11, 2019
Yet more observations about the MJ12Bot
I received a reply about MJ12Bot! Let's see …
- From
- Majestic <XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX>
- To
- Sean Conner <sean@conman.org>
- Subject
- [Majestic] Re: Your robot is making bogus requests to my webserver
- Date
- Thu, 11 Jul 2019 08:34:13 +0000
##- Please type your reply above this line -##
Oh … really? Sigh.
Anyway, the only questionable bit in the email was this line:
The prefix
//
in a link of course refers to the same site as the current page, over the same protocol, so this is why these URLs are being requested back from your server.
which is … somewhat correct. It does mean “use the same protocol” but the double slash denotes a “network path reference” (RFC-3986, section 4.2) where, at a minimum, a hostname is required. If this is just a misunderstanding on the developers' part, it could explain the behavior I'm seeing.
And speaking of behavior, I decided to check the logs (again, using last month) one last time for two reports.
404 (not found) | 200 (okay) | Total requests | User agent |
---|---|---|---|
170 | 42676 | 46334 | The Knowledge AI |
21 | 36088 | 38097 | Mozilla/5.0 (compatible; SemrushBot/3~bl; +http://www.semrush.com/bot.html) |
46 | 16633 | 17130 | Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/) |
5 | 15840 | 15928 | Mozilla/5.0 (compatible; AhrefsBot/6.1; +http://ahrefs.com/robot/) |
3 | 12304 | 12353 | Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) |
36 | 8412 | 8929 | Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler) |
7 | 8428 | 8908 | Gigabot |
5680 | 2015 | 7872 | Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/) |
28 | 6604 | 6942 | Barkrowler/0.9 (+http://www.exensa.com/crawl) |
0 | 4705 | 4737 | istellabot/t.1.13 |
404 (not found) | 200 (okay) | Total requests | User agent |
---|---|---|---|
5680 | 2015 | 7872 | Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/) |
656 | 109 | 768 | Mozilla/5.0 (compatible; MJ12bot/v1.4.7; http://mj12bot.com/) |
177 | 45 | 553 | Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2) |
170 | 42676 | 46334 | The Knowledge AI |
120 | 0 | 120 | Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0) |
(Note: The number of 404s and 200s might not add up to the total—there might be other requests that returned a different status not reported here.)
MJ12Bot is the 8th most active client on my site, yet it has the top two spots for bad requests, beating out #3 by over an order of magnitude (35 times the amount in fact).
But I don't have to worry about it since the email also stated they removed my site from their crawl list. Okay … I guess?
Friday, July 12, 2019
Once more with the MJ12Bot
So I replied to MJ12Bot's reply outlining everything I've mentioned so far about the sheer number of bad links they're following and how their explanation of “//” wasn't correct. They then replied:
- From
- Majestic <XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX>
- To
- Sean Conner <sean@conman.org>
- Subject
- [Majestic] Re: Your robot is making bogus requests to my webserver
- Date
- Fri, 12 Jul 2019 07:27:48 +0000
##- Please type your reply above this line -##
Your ticket reference is XXXXX. To add any comments, just reply to this email.
I can tell from your responses that you are much better than us, so we can only continue to avoid visiting your site.
Kind Regards
XXXXX
I guess this is their way of politely telling me to XXXXXXXX. Fair enough.
Tuesday, July 16, 2019
Notes on blocking the MJ12Bot
The MJ12Bot is the first robot listed in the Wikipedia's robots.txt
file,
which I find amusing for obvious reasons.
In the Hacker News comments there's a thread specifically about the MJ12Bot,
and I replied to a comment about blocking it.
It's not that easy, because it's a distributed bot that has used 136 unique IP addresses just last month.
Because of that comment,
I decided I should expand on some of those numbers here.
The first table is the number of addresses from January through June, 2019 to show they're not all from a single netblock,
The address format “A.B.C.D” will represent a unique IP address, like 172.16.15.2
;
“A.B.C” will represent the IP addresses 172.16.15.0
to 172.16.15.255
;
“A.B” will represent the range 172.16.0.0
to 172.16.255.255
and finally “A” will represent the range 172.0.0.0
to 172.255.255.255
.
Address format | number |
---|---|
A.B.C.D | 312 |
A.B.C | 256 |
A.B | 86 |
A | 53 |
Next are the unique addresses from all of 2018 used by MJ12Bot:
Address format | number |
---|---|
A.B.C.D | 474 |
A,B.C | 370 |
A.B | 125 |
A | 66 |
This wide distribution can easily explain why Wikipedia found it to ignore any rate limits set. Each individual node of MJ12Bot probably followed the rate limit, but it's a hard problem to coordinate across … what? 500 machines across the world?
It seems the best bet is to ban MJ12Bot via robots.txt
:
User-agent: MJ12bot Disallow: /
While I haven't added MJ12Bot to my own robots.txt
file,
it hasn't hit my site since they removed me from their crawl list,
so it appears it can be tamed.