Wednesday, April 13, 2022
It's been long gone, like fifteen years long gone. Why are you still asking?
About a month ago, I was checking my webserver logs when I noticed mutiple requests to pages that have long been marked as gone. The webserver has been returning HTTP status code 410 Gone, in some cases, for over fifteen years! At first I was annoyed—why are these webbots still requesting pages I've marked as gone? But then I started thinking about it—if I were writing a webbot to scan web pages, what would I do if I got a “gone” status? Well, I'd delete any references to said page, for sure. But when what if I came across the link on another page? I don't have the link (because I deleted it earlier) so let's add it to scan. Lather, rinse, repeat.
So there's a page or pages out there that are still linking to the pages that are long gone. And upon further investigation, I found the pages—my own site!
Sigh.
I've fixed some of the links—mostly the ones that have been causing some real issues with Gemini requests, but I still have scores of links to fix in the blog.
I also noticed a large number of permanent redirects, and again, the cause are pages on my own site linking to the non-canonical source. This isn't that much of an issue for HTTP (because the HTTP connection is still open for further requests) but it is one for Gemini (because each request is a separate connection to the server). I started fixing them, but when I did a full scan of the site (and it's mostly links on my blog) there are a significant number of links to fix—around 500 or so. And mostly in the first five years of entries.
Just an observation, nothing more
As I've been cleaning up blog entries (the first few years also have some pretty bad formatting issues) I've noticed something—I used to write a lot more back then. I think part of that is that the whole “blogging” thing was new, and after twenty-plus years, I've covered quite a bit of material. There have been multiple instances where I come across somthing, think I should blog about that, and when I check to see if I have, indeed blogged about that, I already have. Also, it's a bit more of a pain these days as I manually add links to my blogs at MeLinkedInstaMyFaceInGramSpaceBookWe. This used to be automated in the past, but InstaMyFaceMeLinkedWeInGramSpaceBook doesn't play well with others and with constant API updates and walled garden policies, it's sadly, easier to manually update links than it is to automate it (also this chart). I mean, I don't have to update links at MyFaceMeLinkedInstaInGramSpaceBookWe, but pretty much everybody just reads MeLinkedInstaMyFaceWeInGramSpaceBook (Web? What's that?) which is why I bother at all.
Can someone explain to me why this happens?
I don't understand.
It's not just the MJ12Bot that can't parse links. It seems many web robots can't parse links correctly. Last month there were requests like:
/%5C%22gemini://gemini.ctrl-c.club/~stack/gemlog/2022-02-16.tls.gmi%5C%22
/%5C%22/2022/02/23.1%5C%22
/%5C%22https://news.ycombinator.com/item?id=30091336%5C%22
/%5C%22http://thecodelesscode.com/case/219%5C%22
I mean,
the request /2022/02/23.1
does exist on my server,
but not (decoded) /\"/2022/02/23.1\"
.
What?
That's worse than what MJ12Bot was sending back in the day.
And it's not like it's a single web robot making these requests—no! It's three different web robots!
I just … what?