Tuesday, July 16, 2019
Notes on blocking the MJ12Bot
The MJ12Bot is the first robot listed in the Wikipedia's
which I find amusing for obvious reasons.
In the Hacker News comments there's a thread specifically about the MJ12Bot,
and I replied to a comment about blocking it.
It's not that easy, because it's a distributed bot that has used 136 unique IP addresses just last month.
Because of that comment,
I decided I should expand on some of those numbers here.
The first table is the number of addresses from January through June, 2019 to show they're not all from a single netblock,
The address format “A.B.C.D” will represent a unique IP address, like
“A.B.C” will represent the IP addresses
“A.B” will represent the range
172.16.255.255 and finally “A” will represent the range
Next are the unique addresses from all of 2018 used by MJ12Bot:
This wide distribution can easily explain why Wikipedia found it to ignore any rate limits set. Each individual node of MJ12Bot probably followed the rate limit, but it's a hard problem to coordinate across … what? 500 machines across the world?
It seems the best bet is to ban MJ12Bot via
User-agent: MJ12bot Disallow: /
While I haven't added MJ12Bot to my own
it hasn't hit my site since they removed me from their crawl list,
so it appears it can be tamed.