Tuesday, July 16, 2019
Notes on blocking the MJ12Bot
The MJ12Bot is the first robot listed in the Wikipedia's robots.txt
file,
which I find amusing for obvious reasons.
In the Hacker News comments there's a thread specifically about the MJ12Bot,
and I replied to a comment about blocking it.
It's not that easy, because it's a distributed bot that has used 136 unique IP addresses just last month.
Because of that comment,
I decided I should expand on some of those numbers here.
The first table is the number of addresses from January through June, 2019 to show they're not all from a single netblock,
The address format “A.B.C.D” will represent a unique IP address, like 172.16.15.2
;
“A.B.C” will represent the IP addresses 172.16.15.0
to 172.16.15.255
;
“A.B” will represent the range 172.16.0.0
to 172.16.255.255
and finally “A” will represent the range 172.0.0.0
to 172.255.255.255
.
Address format | number |
---|---|
A.B.C.D | 312 |
A.B.C | 256 |
A.B | 86 |
A | 53 |
Next are the unique addresses from all of 2018 used by MJ12Bot:
Address format | number |
---|---|
A.B.C.D | 474 |
A,B.C | 370 |
A.B | 125 |
A | 66 |
This wide distribution can easily explain why Wikipedia found it to ignore any rate limits set. Each individual node of MJ12Bot probably followed the rate limit, but it's a hard problem to coordinate across … what? 500 machines across the world?
It seems the best bet is to ban MJ12Bot via robots.txt
:
User-agent: MJ12bot Disallow: /
While I haven't added MJ12Bot to my own robots.txt
file,
it hasn't hit my site since they removed me from their crawl list,
so it appears it can be tamed.