The Boston Diaries

The ongoing saga of a programmer who doesn't live in Boston, nor does he even like Boston, but yet named his weblog/journal “The Boston Diaries.”

Go figure.

Saturday, April 16, 2022

My common Gemini crawler pitfalls

Just like the common web, crawlers on Gemini can run into similar pitfalls. Though the impact is much lower. The gemtext format is much smaller than HTML. And since Gemini does not support reusing the TCP connection. It takes much longer to mass-crawl a single capsule. Likely I can catch some issues when I see the crawler is still running late night. Anyways, this is a list of issues I have seen.

Common Gemini crawler pitfalls

Martin Chang has some views of crawlers from the crawler's perspective, but I still have some views of crawlers from the receiving end that Martin doesn't cover. I finally got fed up with Gemini crawlers not bothering to limit their following of redirects that I removed not only that particular client test from my site, but the entire client test from my site. Martin does mention a “capsule linter” to check for “infinite extending links,” but that's not an issue a site author should fix just to apease the crawler authors! It's an actual thing that can happen on the Inernet. A crawler must deal with such situations.

Another issue I'm seeing with crawlers is an inability to deal with relative links. I'm seeing requests like gemini://gemini.conman.org/boston/2008/04/30/2008/04/30.1 or gemini://gemini.conman.org//boston/2015/07/02.3. The former I can't wrap my brain around how it got that link (and every request comes from the same IP address—23.88.52.182), while the second one seems like a simple bug to fix (generated by only three different clients—202.61.246.155, 116.202.128.144, 198.50.210.248).

The next issue are the ever present empty requests. People, Gemini is not gopher—empty requests are not allowed. I'm even returning an invalid error status code for this, in the vain hope the people running the crawlers (or clients) would notice the invalid status code. I wonder what might happen if I return a gopher error in this case? I mean, I modified my gopher server to return an HTTP error response when it received an HTTP request, so I think I can justify returning a gopher error from my Gemini server when it receives a gopher request. These types of requests have come from nine different crawlers, and this list includes the ones with the parsing issues.

Continuing on, there have been requests for domains I'm not running a Gemini server on. I'm not running a Gemini server on conman.org, www.conman.org, nor boston.conman.org. The domain is gemini.conman.org.

Another inexplicable request I'm seeing are a bunch of requests of the form gemini://gemini.conman.org/bible/genesis.41:1-57, which are all coming from the same place—202.61.246.155 (this one seems to be a particularly bad crawler). What's weird about it is that the request should be gemini://gemini.conman.org/bible/Genesis.41:1-57 (note the upper case “G” in “Genesis”). The links on the site are properly cased, so this shouldn't be an issue—is the crawler attempting to canonicalize links to lower case? That's not right. And by doing this, this particular crawler is just generating spurious requests (the server will redirect to the proper location).

So yes, those are my common Gemini crawler pitfalls.

Update on Friday, April 22nd, 2022

I have managed to wrap my brain around how it got that link.

Update on Sunday, May 1st, 2022

And yes, the “double slash” bug was a simple, but …

Obligatory Picture

[The future's so bright, I gotta wear shades]

Obligatory Contact Info

Obligatory Feeds

Obligatory Links

Obligatory Miscellaneous

You have my permission to link freely to any entry here. Go ahead, I won't bite. I promise.

The dates are the permanent links to that day's entries (or entry, if there is only one entry). The titles are the permanent links to that entry only. The format for the links are simple: Start with the base link for this site: https://boston.conman.org/, then add the date you are interested in, say 2000/08/01, so that would make the final URL:

https://boston.conman.org/2000/08/01

You can also specify the entire month by leaving off the day portion. You can even select an arbitrary portion of time.

You may also note subtle shading of the links and that's intentional: the “closer” the link is (relative to the page) the “brighter” it appears. It's an experiment in using color shading to denote the distance a link is from here. If you don't notice it, don't worry; it's not all that important.

It is assumed that every brand name, slogan, corporate name, symbol, design element, et cetera mentioned in these pages is a protected and/or trademarked entity, the sole property of its owner(s), and acknowledgement of this status is implied.

Copyright © 1999-2024 by Sean Conner. All Rights Reserved.