The Boston Diaries

The ongoing saga of a programmer who doesn't live in Boston, nor does he even like Boston, but yet named his weblog/journal “The Boston Diaries.”

Go figure.

Wednesday, April 13, 2022

It's been long gone, like fifteen years long gone. Why are you still asking?

About a month ago, I was checking my webserver logs when I noticed mutiple requests to pages that have long been marked as gone. The webserver has been returning HTTP status code 410 Gone, in some cases, for over fifteen years! At first I was annoyed—why are these webbots still requesting pages I've marked as gone? But then I started thinking about it—if I were writing a webbot to scan web pages, what would I do if I got a “gone” status? Well, I'd delete any references to said page, for sure. But when what if I came across the link on another page? I don't have the link (because I deleted it earlier) so let's add it to scan. Lather, rinse, repeat.

So there's a page or pages out there that are still linking to the pages that are long gone. And upon further investigation, I found the pages—my own site!

Sigh.

I've fixed some of the links—mostly the ones that have been causing some real issues with Gemini requests, but I still have scores of links to fix in the blog.

I also noticed a large number of permanent redirects, and again, the cause are pages on my own site linking to the non-canonical source. This isn't that much of an issue for HTTP (because the HTTP connection is still open for further requests) but it is one for Gemini (because each request is a separate connection to the server). I started fixing them, but when I did a full scan of the site (and it's mostly links on my blog) there are a significant number of links to fix—around 500 or so. And mostly in the first five years of entries.


Just an observation, nothing more

As I've been cleaning up blog entries (the first few years also have some pretty bad formatting issues) I've noticed something—I used to write a lot more back then. I think part of that is that the whole “blogging” thing was new, and after twenty-plus years, I've covered quite a bit of material. There have been multiple instances where I come across somthing, think I should blog about that, and when I check to see if I have, indeed blogged about that, I already have. Also, it's a bit more of a pain these days as I manually add links to my blogs at Me­Linked­Insta­My­Face­In­Gram­Space­Book­We. This used to be automated in the past, but Insta­My­Face­Me­Linked­We­In­Gram­Space­Book doesn't play well with others and with constant API updates and walled garden policies, it's sadly, easier to manually update links than it is to automate it (also this chart). I mean, I don't have to update links at My­Face­Me­Linked­Insta­In­Gram­Space­Book­We, but pretty much everybody just reads Me­Linked­Insta­My­Face­We­In­Gram­Space­Book (Web? What's that?) which is why I bother at all.


Can someone explain to me why this happens?

I don't understand.

It's not just the MJ12Bot that can't parse links. It seems many web robots can't parse links correctly. Last month there were requests like:

I mean, the request /2022/02/23.1 does exist on my server, but not (decoded) /\"/2022/02/23.1\".

What?

That's worse than what MJ12Bot was sending back in the day.

And it's not like it's a single web robot making these requests—no! It's three different web robots!

I just … what?

Obligatory Picture

[The future's so bright, I gotta wear shades]

Obligatory Contact Info

Obligatory Feeds

Obligatory Links

Obligatory Miscellaneous

You have my permission to link freely to any entry here. Go ahead, I won't bite. I promise.

The dates are the permanent links to that day's entries (or entry, if there is only one entry). The titles are the permanent links to that entry only. The format for the links are simple: Start with the base link for this site: https://boston.conman.org/, then add the date you are interested in, say 2000/08/01, so that would make the final URL:

https://boston.conman.org/2000/08/01

You can also specify the entire month by leaving off the day portion. You can even select an arbitrary portion of time.

You may also note subtle shading of the links and that's intentional: the “closer” the link is (relative to the page) the “brighter” it appears. It's an experiment in using color shading to denote the distance a link is from here. If you don't notice it, don't worry; it's not all that important.

It is assumed that every brand name, slogan, corporate name, symbol, design element, et cetera mentioned in these pages is a protected and/or trademarked entity, the sole property of its owner(s), and acknowledgement of this status is implied.

Copyright © 1999-2024 by Sean Conner. All Rights Reserved.