Tuesday, July 16, 2019

Notes on blocking the MJ12Bot

The MJ12Bot is the first robot listed in the Wikipedia's robots.txt file, which I find amusing for obvious reasons. In the Hacker News comments there's a thread specifically about the MJ12Bot, and I replied to a comment about blocking it. It's not that easy, because it's a distributed bot that has used 136 unique IP addresses just last month. Because of that comment, I decided I should expand on some of those numbers here.

The first table is the number of addresses from January through June, 2019 to show they're not all from a single netblock, The address format “A.B.C.D” will represent a unique IP address, like; “A.B.C” will represent the IP addresses to; “A.B” will represent the range to and finally “A” will represent the range to

Number of distinct IP addresses used by MJ12Bot in 2019 when hitting my site
Address format number
A.B.C.D 312
A.B.C 256
A.B 86
A 53

Next are the unique addresses from all of 2018 used by MJ12Bot:

Number of distinct IP addresses used by MJ12Bot in 2018 when hitting my site
Address format number
A.B.C.D 474
A,B.C 370
A.B 125
A 66

This wide distribution can easily explain why Wikipedia found it to ignore any rate limits set. Each individual node of MJ12Bot probably followed the rate limit, but it's a hard problem to coordinate across … what? 500 machines across the world?

It seems the best bet is to ban MJ12Bot via robots.txt:

User-agent: MJ12bot
Disallow: /

While I haven't added MJ12Bot to my own robots.txt file, it hasn't hit my site since they removed me from their crawl list, so it appears it can be tamed.

