My common Gemini crawler pitfalls

Saturday, April 16, 2022

Just like the common web, crawlers on Gemini can run into similar pitfalls. Though the impact is much lower. The gemtext format is much smaller than HTML. And since Gemini does not support reusing the TCP connection. It takes much longer to mass-crawl a single capsule. Likely I can catch some issues when I see the crawler is still running late night. Anyways, this is a list of issues I have seen.

Common Gemini crawler pitfalls

Martin Chang has some views of crawlers from the crawler's perspective, but I still have some views of crawlers from the receiving end that Martin doesn't cover. I finally got fed up with Gemini crawlers not bothering to limit their following of redirects that I removed not only that particular client test from my site, but the entire client test from my site. Martin does mention a “capsule linter” to check for “infinite extending links,” but that's not an issue a site author should fix just to apease the crawler authors! It's an actual thing that can happen on the Inernet. A crawler must deal with such situations.

Another issue I'm seeing with crawlers is an inability to deal with relative links. I'm seeing requests like gemini://gemini.conman.org/boston/2008/04/30/2008/04/30.1 or gemini://gemini.conman.org//boston/2015/07/02.3. The former I can't wrap my brain around how it got that link (and every request comes from the same IP address—23.88.52.182), while the second one seems like a simple bug to fix (generated by only three different clients—202.61.246.155, 116.202.128.144, 198.50.210.248).

The next issue are the ever present empty requests. People, Gemini is not gopher—empty requests are not allowed. I'm even returning an invalid error status code for this, in the vain hope the people running the crawlers (or clients) would notice the invalid status code. I wonder what might happen if I return a gopher error in this case? I mean, I modified my gopher server to return an HTTP error response when it received an HTTP request, so I think I can justify returning a gopher error from my Gemini server when it receives a gopher request. These types of requests have come from nine different crawlers, and this list includes the ones with the parsing issues.

Continuing on, there have been requests for domains I'm not running a Gemini server on. I'm not running a Gemini server on conman.org, www.conman.org, nor boston.conman.org. The domain is gemini.conman.org.

Another inexplicable request I'm seeing are a bunch of requests of the form gemini://gemini.conman.org/bible/genesis.41:1-57, which are all coming from the same place—202.61.246.155 (this one seems to be a particularly bad crawler). What's weird about it is that the request should be gemini://gemini.conman.org/bible/Genesis.41:1-57 (note the upper case “G” in “Genesis”). The links on the site are properly cased, so this shouldn't be an issue—is the crawler attempting to canonicalize links to lower case? That's not right. And by doing this, this particular crawler is just generating spurious requests (the server will redirect to the proper location).

So yes, those are my common Gemini crawler pitfalls.

Update on Friday, April 22^nd, 2022

I have managed to wrap my brain around how it got that link.

Update on Sunday, May 1^st, 2022

And yes, the “double slash” bug was a simple, but …

The Boston Diaries

Saturday, April 16, 2022