The Boston Diaries

The ongoing saga of a programmer who doesn't live in Boston, nor does he even like Boston, but yet named his weblog/journal “The Boston Diaries.”

Go figure.

Tuesday, July 16, 2019

Notes on blocking the MJ12Bot

The MJ12Bot is the first robot listed in the Wikipedia's robots.txt file, which I find amusing for obvious reasons. In the Hacker News comments there's a thread specifically about the MJ12Bot, and I replied to a comment about blocking it. It's not that easy, because it's a distributed bot that has used 136 unique IP addresses just last month. Because of that comment, I decided I should expand on some of those numbers here.

The first table is the number of addresses from January through June, 2019 to show they're not all from a single netblock, The address format “A.B.C.D” will represent a unique IP address, like 172.16.15.2; “A.B.C” will represent the IP addresses 172.16.15.0 to 172.16.15.255; “A.B” will represent the range 172.16.0.0 to 172.16.255.255 and finally “A” will represent the range 172.0.0.0 to 172.255.255.255.

Number of distinct IP addresses used by MJ12Bot in 2019 when hitting my site
Address format number
A.B.C.D 312
A.B.C 256
A.B 86
A 53

Next are the unique addresses from all of 2018 used by MJ12Bot:

Number of distinct IP addresses used by MJ12Bot in 2018 when hitting my site
Address format number
A.B.C.D 474
A,B.C 370
A.B 125
A 66

This wide distribution can easily explain why Wikipedia found it to ignore any rate limits set. Each individual node of MJ12Bot probably followed the rate limit, but it's a hard problem to coordinate across … what? 500 machines across the world?

It seems the best bet is to ban MJ12Bot via robots.txt:

User-agent: MJ12bot
Disallow: /

While I haven't added MJ12Bot to my own robots.txt file, it hasn't hit my site since they removed me from their crawl list, so it appears it can be tamed.

Obligatory Picture

[The future's so bright, I gotta wear shades]

Obligatory Contact Info

Obligatory Feeds

Obligatory Links

Obligatory Miscellaneous

You have my permission to link freely to any entry here. Go ahead, I won't bite. I promise.

The dates are the permanent links to that day's entries (or entry, if there is only one entry). The titles are the permanent links to that entry only. The format for the links are simple: Start with the base link for this site: https://boston.conman.org/, then add the date you are interested in, say 2000/08/01, so that would make the final URL:

https://boston.conman.org/2000/08/01

You can also specify the entire month by leaving off the day portion. You can even select an arbitrary portion of time.

You may also note subtle shading of the links and that's intentional: the “closer” the link is (relative to the page) the “brighter” it appears. It's an experiment in using color shading to denote the distance a link is from here. If you don't notice it, don't worry; it's not all that important.

It is assumed that every brand name, slogan, corporate name, symbol, design element, et cetera mentioned in these pages is a protected and/or trademarked entity, the sole property of its owner(s), and acknowledgement of this status is implied.

Copyright © 1999-2024 by Sean Conner. All Rights Reserved.