The Boston Diaries

Tuesday, April 12, 2022

Github shenanigans

It all started with a simple pull request to fix a bug. I have never attempted to just “merge” a pull request on Github before, but I figured, with such a simple change, why not try? Why not indeed.

Well, it broke my local repository. The commit message wasn't what I would have liked, and I felt a revision of the version number was required, which also involved updating the makefile and the Luarocks specification file. I made the mistake (I think—I don't know) of amending the merge message with a reformatted title and extra files and that was that. I was unable to push the changes back to Github.

I ended up having to reset both my local repositories and the Github repository. Hard. As in with the git reset --hard nuclear option. And hand added the patch into the code, redid all the changes to the makefile and Luarocks specification file multiple times. Ugly stuff. But I got it as I like it.

And then I went to load the new version of the code into Luarocks and of course it failed. Of course. Github decided several months ago to depcrecate support for git: URLs and guess what I'm using?

Sigh.

It took longer than I liked to find out I need to switch to using git+https: URLs, and several version bumps of several of my Lua modules to get it all straightened out. I just cannot update the Luarocks specification files properly. It always takes way too many tries for me to get it right. Aaaaaah!

I'm also unsure why the Github merge failed for me. Am I not using the “proper” work flow? Is it because Github considers itself the “primary repository” when in fact, for my stuff, it isn't? I don't know. Perhaps I'm slowly becoming computer illiterate.

Update on Wednesday, April 13^th, 2022

It seems the latest version of Luarocks will auto-correct git: URLs. [See what you get when you don't update every 20 minutes? —Editor] [Shut up, you! —Sean] I'm not sure what to think of this.

Wednesday, April 13, 2022

It's been long gone, like fifteen years long gone. Why are you still asking?

About a month ago, I was checking my webserver logs when I noticed mutiple requests to pages that have long been marked as gone. The webserver has been returning HTTP status code 410 Gone, in some cases, for over fifteen years! At first I was annoyed—why are these webbots still requesting pages I've marked as gone? But then I started thinking about it—if I were writing a webbot to scan web pages, what would I do if I got a “gone” status? Well, I'd delete any references to said page, for sure. But when what if I came across the link on another page? I don't have the link (because I deleted it earlier) so let's add it to scan. Lather, rinse, repeat.

So there's a page or pages out there that are still linking to the pages that are long gone. And upon further investigation, I found the pages—my own site!

Sigh.

I've fixed some of the links—mostly the ones that have been causing some real issues with Gemini requests, but I still have scores of links to fix in the blog.

I also noticed a large number of permanent redirects, and again, the cause are pages on my own site linking to the non-canonical source. This isn't that much of an issue for HTTP (because the HTTP connection is still open for further requests) but it is one for Gemini (because each request is a separate connection to the server). I started fixing them, but when I did a full scan of the site (and it's mostly links on my blog) there are a significant number of links to fix—around 500 or so. And mostly in the first five years of entries.

Just an observation, nothing more

As I've been cleaning up blog entries (the first few years also have some pretty bad formatting issues) I've noticed something—I used to write a lot more back then. I think part of that is that the whole “blogging” thing was new, and after twenty-plus years, I've covered quite a bit of material. There have been multiple instances where I come across somthing, think I should blog about that, and when I check to see if I have, indeed blogged about that, I already have. Also, it's a bit more of a pain these days as I manually add links to my blogs at MeLinkedInstaMyFaceInGramSpaceBookWe. This used to be automated in the past, but InstaMyFaceMeLinkedWeInGramSpaceBook doesn't play well with others and with constant API updates and walled garden policies, it's sadly, easier to manually update links than it is to automate it (also this chart). I mean, I don't have to update links at MyFaceMeLinkedInstaInGramSpaceBookWe, but pretty much everybody just reads MeLinkedInstaMyFaceWeInGramSpaceBook (Web? What's that?) which is why I bother at all.

Can someone explain to me why this happens?

I don't understand.

It's not just the MJ12Bot that can't parse links. It seems many web robots can't parse links correctly. Last month there were requests like:

/%5C%22gemini://gemini.ctrl-c.club/~stack/gemlog/2022-02-16.tls.gmi%5C%22
/%5C%22/2022/02/23.1%5C%22
/%5C%22https://news.ycombinator.com/item?id=30091336%5C%22
/%5C%22http://thecodelesscode.com/case/219%5C%22

I mean, the request /2022/02/23.1 does exist on my server, but not (decoded) /\"/2022/02/23.1\".

What?

That's worse than what MJ12Bot was sending back in the day.

And it's not like it's a single web robot making these requests—no! It's three different web robots!

I just … what?

Friday, April 15, 2022

There are only two certainties in life; this post is about one of them, and it's not death

Last night I was busy with cleaning up my past blog entries when I happened to notice the date—April 15^th. And then I suddenly remembers—I need to do my taxes! Well, as I found out later, I have until the 18^th to file, but I always feel better if I'm able to get them filed by … well, technically, today.

Fortunately, my taxes aren't that hard and it only took about half an hour for me to fill out the 1040 form (I'm not a fan of electronic filing, but that's only because I know how the electronic sausage is made), take the form to the post office and have it hand cancelled by the post master (okay, that last bit was done by Bunny since she was out running errands at the time).

Saturday, April 16, 2022

My common Gemini crawler pitfalls

Just like the common web, crawlers on Gemini can run into similar pitfalls. Though the impact is much lower. The gemtext format is much smaller than HTML. And since Gemini does not support reusing the TCP connection. It takes much longer to mass-crawl a single capsule. Likely I can catch some issues when I see the crawler is still running late night. Anyways, this is a list of issues I have seen.

Common Gemini crawler pitfalls

Martin Chang has some views of crawlers from the crawler's perspective, but I still have some views of crawlers from the receiving end that Martin doesn't cover. I finally got fed up with Gemini crawlers not bothering to limit their following of redirects that I removed not only that particular client test from my site, but the entire client test from my site. Martin does mention a “capsule linter” to check for “infinite extending links,” but that's not an issue a site author should fix just to apease the crawler authors! It's an actual thing that can happen on the Inernet. A crawler must deal with such situations.

Another issue I'm seeing with crawlers is an inability to deal with relative links. I'm seeing requests like gemini://gemini.conman.org/boston/2008/04/30/2008/04/30.1 or gemini://gemini.conman.org//boston/2015/07/02.3. The former I can't wrap my brain around how it got that link (and every request comes from the same IP address—23.88.52.182), while the second one seems like a simple bug to fix (generated by only three different clients—202.61.246.155, 116.202.128.144, 198.50.210.248).

The next issue are the ever present empty requests. People, Gemini is not gopher—empty requests are not allowed. I'm even returning an invalid error status code for this, in the vain hope the people running the crawlers (or clients) would notice the invalid status code. I wonder what might happen if I return a gopher error in this case? I mean, I modified my gopher server to return an HTTP error response when it received an HTTP request, so I think I can justify returning a gopher error from my Gemini server when it receives a gopher request. These types of requests have come from nine different crawlers, and this list includes the ones with the parsing issues.

Continuing on, there have been requests for domains I'm not running a Gemini server on. I'm not running a Gemini server on conman.org, www.conman.org, nor boston.conman.org. The domain is gemini.conman.org.

Another inexplicable request I'm seeing are a bunch of requests of the form gemini://gemini.conman.org/bible/genesis.41:1-57, which are all coming from the same place—202.61.246.155 (this one seems to be a particularly bad crawler). What's weird about it is that the request should be gemini://gemini.conman.org/bible/Genesis.41:1-57 (note the upper case “G” in “Genesis”). The links on the site are properly cased, so this shouldn't be an issue—is the crawler attempting to canonicalize links to lower case? That's not right. And by doing this, this particular crawler is just generating spurious requests (the server will redirect to the proper location).

So yes, those are my common Gemini crawler pitfalls.

Update on Friday, April 22^nd, 2022

I have managed to wrap my brain around how it got that link.

Update on Sunday, May 1^st, 2022

And yes, the “double slash” bug was a simple, but …

Wednesday, April 20, 2022

The day I met the creator of Garfield

When I saw the date today, I remembered that on this day back in 1981, I met Jim Davis, creator of Garfield. And of course I've already written about this. But thinking about it, I'm not sure what to make of this—that it's been 22 years since I last wrote about it, which is longer than the time between the actual event and writing about it the first time (19 years). It's also sobering to think it's been 41 years since I met him, and I still remember it like it happened yesterday.

Sigh.

Thursday, April 21, 2022

Notes on obtaining a process ID from Java on Mac OS Big Sur

There's a tool I use at work to manually test the code we work on. It's a graphical tool written in Java, and it allows us to configure the data for a test. It then runs another process that starts everything up, and then runs another tool to inject a SIP message into the code being tested. It then opens the output file from the SIP injection tool to display the results. This tool doesn't work quite right on Belial, the annoying Mac Laptop.

Problem one—the output file has, as part of its name, the process ID of the SIP injection tool. And it seems there is no easy way to get said process ID from within Java. The other issue, also related to process IDs, is it attempts to stop the process that starts everything up. That fails, because again, there is no easy way to get a process ID from within Java.

There is code that attempts to get a process ID:

process = pb.start();
if (process.getClass().getName().equals("java.lang.UNIXProcess"))
{
  try
  {
    Field f = process.getClass().getDeclaredField("pid");
    f.setAccessible(true);
    sippPid = f.getInt(process);
  }
  catch (Throwable e)
  {                
  }
}

This horrible bit of code does work under Linux, and it works on older versions of Mac OS-X. But Belial is a more modern version of Mac OS-X, on the new Apple M1 architecture. Here, sippPid is always 0.

Sigh.

Notes on fixing a Java issue on Mac OS Big Sur

When last we met, I was left with a broken test tool on the newer Mac laptops. The issue at hand is that it's problematic to obtain process IDs in Java, which the testing tool needs for two things. The first is an output file. It turns out one can specify the output file the SIP injection tool generates instead of the default one which uses a process ID. This also makes it easier to check the output since you don't have to grovel through the directory for an ever-changing file name. That issue fixed.

The second one—how to stop the program that runs all the programs that are being tested. The code used the process ID to terminate that program by shelling out to run kill -SIGINT pid. It turns out the Java Process object does have a destroy() method (it sends a SIGTERM to a process, which is fine). It was just a simple matter to update the code to use the destroy() method to terminate the program rather than trying to obtain the process ID in a dodgy way. That issue fixed.

Now all I have to do is spend a few weeks trying to get the code commited to the repository (yeah, I'm still trying to get used to the process—sigh).

Friday, April 22, 2022

Notes on some extreme lawn ornaments, Brevard edition

Eight years ago (wow! Has it been that long? [Yes. —Editor] [Who asked you? —Sean]) while in Brevard, I took a picture of some extreme lawn ornaments—life sized plastic cows. I wrote the “eat moar chikin” image caption (if you hold your mouse over the image, it should pop up) because the cows reminded me of the cows used by Chick-fil-a.

I'm reading the Transylvania Times when I come across the article “Transylvanian of the Week: John Taylor.” He owns O.P. Taylor's, a well known toy store in the area, and he's the one with the life sized plastic cows in his front yard. Not only that, but he purchased them from the person who made them for Chick-fil-a. Little did I know that my caption was more correct than I thought.

Play stupid games, win stupid prizes

It's not only Gemini bots having issues with redirects. I'm poking around the logs from my webserver, when I scan all of them to see the breakdown of response codes my server is sending (for this month). And well … it's rather surprising:

Breakdown of HTTP response codes from all the sites I host
Status	Meaning	Count
Status	Meaning	Count
302	Found (moved temporarily)	253773
200	OK	178414
304	Not Modified	25552
404	Not Found	8214
301	Moved Permanently	6358
405	Method Not Allowed	1453
410	Gone	685
400	Bad Request	255
206	Partial Content	151
401	Unauthorized	48
500	Internal Server Error	24
403	Forbidden	4

I was not expecting that many temporary redirects. Was it some massive issue across all the sites? Or just a few? Well, it turned all of the temporary redirects were from one site: http://www.flummux.org/ (and no, I'm not linking to it as the reason why will become clear). I registered the domain way back in 2000 just as a place to play around with web stuff or to temporarly make files available without cluttering up my main websites. The site isn't meant to be at all serious.

Scanning the log file manually, I was seeing endless log entries like:

XXXXXXXXXXXXXXX - - [10/Apr/2022:20:55:05 -0400] "GET / HTTP/1.0" 302 284 "http://flummux.org/" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; MRA 4.6 (build 01425); .NET CLR 1.0.3705; .NET CLR 2.0.50727)" -/- (-%)

That log entry indicates a “browser” from IP address XXXXXXXXXXXXXXX, identifying itself as “Mozilla (yada yada)” on the 10^th of April, attempted to get the main page, as referred by http://flummux.org/. And for how many times this happened, broken down by browser:

Top five user agents making the troublesome requests
Count	User agent
127100	Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; MRA 4.6 (build 01425); .NET CLR 1.0.3705; .NET CLR 2.0.50727)
126495	Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; InfoPath.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET4.0C; .NET4.0E)
42	Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36
36	CATExplorador/1.0beta (sistemes at domini dot cat; https://domini.cat/catexplorador/)
15	Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:94.0) Gecko/20100101 Firefox/94.0

Ah, two “browsers” that don't limit the number of redirects they follow. And amusingly enough, both agents came from the same IP address. Or maybe it's the same agent, just lying about what it is. Who knows? Well, aside from the author(s) of said “browser.”

But what was all horribly confusing to me why the server was issuing a temporary redirect. Yes, if you try to go to http://flummux.org/ the server will repond with a permanent redirect (status 301) to http://www.flummux.org/ (the reasons for that is to canonicalize the URLs and avoid the “duplicate content penalty” from Google—I set this all up years ago). But the site shouldn't redirect again. I can bring the site up in my browser without issue (which is a visual … pun? Commentary? Joke? on the line “The sky above the port was the color of television, tuned to a dead channel.”).

And then I remembered—back in 2016, I set things up such that if the browser sent in a referring link, the page would temporarily redirect back to the referring link (which is why I'm not linking to it—you would just be redirected right back to this page). I set that up on a lark for some reason that now esacapes me. So the above “browsers” kept bouncing back and forth between flummux.org and www.flummux.org. For a quarter of a million requests.

Sigh.

In other news, bugs are nothing more than an inattention to detail.

I have now wrapped my brain around how it got that link

Martin Chang replied to my post about Gemini crawlers, saying that it was his crawler that had sent links like gemini://gemini.conman.org/boston/2008/04/30/2008/04/30.1 and decided to look into the issue. Well, he did, and he found it wasn't his issue, but mine.

Oh my.

Okay, so how did I end up generating links like gemini://gemini.conman.org/boston/2008/04/30/2008/04/30.1?

This is, first and foremost, a blog on the web. Each entry is stored as HTML, and when a request is made via gopher or Gemini, the entries making up the request are retrieved and converted to the appropriate format. As part of that conversion, links to the blog itself have to be translated appropriately, and that's where the error happened.

So, for example, the links for the above entry are collected:

http://www.cisco.com/
http://it.slashdot.org/article.pl?sid=08/04/29/2254242
http://www.arin.net/
2008/04/30.1#fn-2008-04-30-1-1
http://www.barracudanetworks.com/
http://answers.yahoo.com/question/index?qid=20080219010714AAnF91Q

Those links with a URL scheme are passed through as is, but #4 is special, not only is it a relative link to my blog, but it also contains a URL fragment, and that's where things went pear-shaped. The code to do the URL translations parsed each link as a URL, but for relative links, I used the string, not the parsed URL structure. As such, the code didn't work so well with URL fragments, and thus, I ended up with links like gemini://gemini.conman.org/boston/2008/04/30/2008/04/30.1 (for the record, the same bug was in the gopher translation code as well).

The fix, as for most bugs, was easy once the core issue was identified. The other issues I talked about are, as far as I can tell, not stuff I can fix.

Saturday, April 23, 2022

Does that mean I know have to unit test my text-only websites?

I fixed the infinite redirections from Hell bug. And again, like most bugs, it was an easy fix—just don't redirect if you come from http://flummux.org/. It feels weird to think of having to test a text-only website, but there is a form of programming involved, so it shouldn't be as much of a surprise as it is.

Sigh.

“We're a local newspaper run by a non-local company, we don't care about European readers”

I was reading Conman's latest article, and he linked to a page called «Transilvania Times». I wanted to see it, but for the first time since the vote of the GPDR my visit was denied because I'm European.

gemini://station.martinrue.com/adou/f3868913db6e409eae9fa67845f70324

The “GPDR” is a typo—the author actually meant the GDPR. And it pains me to see something like this happen. Here's someone from Europe who was interested in reading a story about a person in a small US town and yet, they couldn't because the owners of the news website (which isn't owned locally, but instead by a larger company in another state) probably doesn't care about European readers. The company does have a policy for California readers, so I don't see why it can't be extended for the GDPR. This is just so short sighted.

Saturday, April 30, 2022

Musings on processing malformed Gemini (and web) requests

I'm still bothered with Gemini requests like gemini://gemini.conman.org//boston/2015/10/17.2. I thought it might be a simple bug but now I'm not so sure. There's a client out there that has made 1,070 such requests, and if that was all, or even most, of the requests, then yes, that's probably a simple bug. But it's not. It turns out to be only 4% of the requests from said client are malformed in that way. Which to me indicates that something out there might be generating such links (and for this case, I checked and I don't think I'm the cause this time).

I decided to see what happens on the web. I poked a few web sites with similar “double slash” requests and I got mixed results. Most of the sites just accepted them as is and served up a page. The only site that seemed to have issues with it was Hacker News, and I'm not sure what status it returned since it's difficult to obtain the status codes from browsers.

So, I have a few options.

I can keep the current code and always reject such requests. In my mind, such requests have no meaning and are malformed, so why shouldn't I just reject them?
I can send a permanent redirection to the “proper” location. This has the upside of maintaining a canonical link to each page, but with the downside of forcing clients through an additional request, and me having to live with the redundant requests in the log files. But it's obvious what resource is being requested, and sending a permenent redirect informs the client of the proper location.
I can just silently clean up the request and carry on. The upside—clean logs with only one request. The downside—two (or more) valid locations for content. On the one hand, this just feels wrong to me, as technically speaking, /foo and //foo should be different resources (as per Uniform Resource Identifier: Generic Syntax, /foo and /foo/ are technically different resources, so why not this case?). On the other hand, this issue is generally ignored by most web servers out there anyway, so there's that precendent. On the gripping hand, doing this just seems like a cop out and blindly following what the web does.

Well, how do current Gemini servers deal with it? Pretty much like existing web servers—most just treat multiple slashses as a single slash. I think my server is the outlier here. Now the question is—how pedantic do I want to be? Is “good enough” better then “perfect?”

Perhaps a better question is—why am I worrying about this anyway?

Tuesday, April 12, 2022

Github shenanigans

Update on Wednesday, April 13^th, 2022

Wednesday, April 13, 2022

It's been long gone, like fifteen years long gone. Why are you still asking?

Just an observation, nothing more

Can someone explain to me why this happens?

Friday, April 15, 2022

There are only two certainties in life; this post is about one of them, and it's not death

Saturday, April 16, 2022

My common Gemini crawler pitfalls

Update on Friday, April 22^nd, 2022

Update on Sunday, May 1^st, 2022

Wednesday, April 20, 2022

The day I met the creator of Garfield

Thursday, April 21, 2022

Notes on obtaining a process ID from Java on Mac OS Big Sur

Notes on fixing a Java issue on Mac OS Big Sur

Friday, April 22, 2022

Notes on some extreme lawn ornaments, Brevard edition

Play stupid games, win stupid prizes

I have now wrapped my brain around how it got that link

Saturday, April 23, 2022

Does that mean I know have to unit test my text-only websites?

“We're a local newspaper run by a non-local company, we don't care about European readers”

Saturday, April 30, 2022

Musings on processing malformed Gemini (and web) requests

Obligatory Picture

Obligatory Contact Info

Obligatory Feeds

Obligatory Links

Obligatory Miscellaneous

Obligatory AI Disclaimer

The Boston Diaries

Update on Wednesday, April 13th, 2022

Update on Friday, April 22nd, 2022

Update on Sunday, May 1st, 2022

Obligatory Picture

Obligatory Contact Info

Obligatory Feeds

Obligatory Links

Obligatory Miscellaneous

Obligatory AI Disclaimer

Update on Wednesday, April 13^th, 2022

Update on Friday, April 22^nd, 2022

Update on Sunday, May 1^st, 2022