Saturday, April 16, 2022
My common Gemini crawler pitfalls
Just like the common web, crawlers on Gemini can run into similar pitfalls. Though the impact is much lower. The gemtext format is much smaller than HTML. And since Gemini does not support reusing the TCP connection. It takes much longer to mass-crawl a single capsule. Likely I can catch some issues when I see the crawler is still running late night. Anyways, this is a list of issues I have seen.
Common Gemini crawler pitfalls
Martin Chang has some views of crawlers from the crawler's perspective, but I still have some views of crawlers from the receiving end that Martin doesn't cover. I finally got fed up with Gemini crawlers not bothering to limit their following of redirects that I removed not only that particular client test from my site, but the entire client test from my site. Martin does mention a “capsule linter” to check for “infinite extending links,” but that's not an issue a site author should fix just to apease the crawler authors! It's an actual thing that can happen on the Inernet. A crawler must deal with such situations.
Another issue I'm seeing with crawlers is an inability to deal with
relative links. I'm seeing requests like
gemini://gemini.conman.org/boston/2008/04/30/2008/04/30.1
or
gemini://gemini.conman.org//boston/2015/07/02.3
. The former I can't wrap my brain around how it got
that link (and every request comes from the same IP address—23.88.52.182), while the second one seems like a
simple bug to fix (generated by only three different clients—202.61.246.155,
116.202.128.144, 198.50.210.248).
The next issue are the ever present empty requests. People, Gemini is not gopher—empty requests are not allowed. I'm even returning an invalid error status code for this, in the vain hope the people running the crawlers (or clients) would notice the invalid status code. I wonder what might happen if I return a gopher error in this case? I mean, I modified my gopher server to return an HTTP error response when it received an HTTP request, so I think I can justify returning a gopher error from my Gemini server when it receives a gopher request. These types of requests have come from nine different crawlers, and this list includes the ones with the parsing issues.
Continuing on, there have been requests for domains I'm not running a
Gemini server on. I'm not running a Gemini server on conman.org
,
www.conman.org
, nor boston.conman.org
. The domain
is gemini.conman.org
.
Another inexplicable request I'm seeing are a bunch of requests of the
form gemini://gemini.conman.org/bible/genesis.41:1-57
, which are
all coming from the same place—202.61.246.155 (this one seems to be a
particularly bad crawler). What's weird about it is that the request should
be gemini://gemini.conman.org/bible/Genesis.41:1-57
(note the
upper case “G” in “Genesis”). The links on the site are
properly cased, so this shouldn't be an issue—is the crawler attempting to
canonicalize links to lower case? That's not right. And by doing this, this
particular crawler is just generating spurious requests (the server will
redirect to the proper location).
So yes, those are my common Gemini crawler pitfalls.
Update on Friday, April 22nd, 2022
I have managed to wrap my brain around how it got that link.
Update on Sunday, May 1st, 2022
And yes, the “double slash” bug was a simple, but …