Friday, March 21, 2025
Now a bit about feed readers
There are a few bots acting less than optimally that aren't some LLM-based company scraping my site. I think. Anyway, the first one I mentioned:
Agent | Requests |
---|---|
Feedly/1.0 (+https://feedly.com/poller.html; 16 subscribers; ) | 1667 |
Feedly/1.0 (+https://feedly.com/poller.html; 6 subscribers; ) | 1419 |
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 6 subscribers; ) | 938 |
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 16 subscribers; ) | 811 |
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 8 subscribers; ) | 94 |
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 37 subscribers; ) | 17 |
Agent | Requests |
---|---|
Feedly/1.0 (+https://feedly.com/poller.html; 16 subscribers; ) | 1579 |
Feedly/1.0 (+https://feedly.com/poller.html; 6 subscribers; ) | 1481 |
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 6 subscribers; ) | 905 |
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 16 subscribers; ) | 741 |
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 8 subscribers; ) | 90 |
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 37 subscribers; ) | 11 |
This is feedly, a company that offers a news reader (and I'd like to thank the 67 subscribers I have—thank you). The first issue I have about this client is the apparent redundant requests from six different clients. An issue because I only have three different feeds, the Atom feed, the RSS feed and the the JSON feed. The poller seems to be acting correctly—16 subscribers to my Atom feed and 6 to the RSS feed. The other four? The fetchers? I'm not sure what's going on there. There's one for the RSS feed, and three for the Atom feed. And one of them is a typo—it's requesting “//index.atom” instead of the proper “/index.atom” (but apparently Apache allows it). How do I have 16 subscribers to “/index.atom” and another 37 for “/index.atom”? What exactly, is the difference between the two? And can't you fix the “//index.atom” reference? To me, that's an obvious typo, one that could be verified by retreiving both “/index.atom” and “//index.atom” and seeing they're the same.
Anyway, the second issue I have with feedly is their apparent lack of caching on their end. They do not do a conditional request and while they aren't exactly slamming my server, they are making multiple requests per hour, and for a resource that doesn't change all that often (excluding today that is).
Then there's the bot at IP address 4.231.104.62. It made 43,236 requests to get “/index.atom”, 5 invalid requests in the form of “/gopher://gopher.conman.org/0Phlog:2025/02/…” and one other valid request for this page. It's not the 5 invalid requests or the 1 valid request that has me weirded out—it's the 43,236 to my Atom feed. That's one request every 55 seconds! And even worse—it's not a conditional request! Of all the bots, this is the one I feel most like blocking at the firewall level—just have it drop the packets entirely.
At least it supports compressed results.
Sheesh.
As for the rest—of the 109 bots that fetched the Atom feed at least once per day (I put the cut off at 28 requests or more durring February), only 31 did so conditionally. That's a horrible rate. And of the 31 that did so conditionally, most don't support compression. So on the one hand, the majority of bots that fetch the Atom feed do so compressed. On the other hand, it appears that the bots that do fetch conditionally most don't support compression.
Sigh.
Still no information on who “The Knowledge AI” is or was
Back in July 2019 I was investigating some bad bots on my website when I came across the bot that identified itself simply as “The Knowledge AI” that was the number one robot hitting my site. Most bots that identify themselves will give a URL to a page that describes their usage like Barkrowler (to pick one that recently crawled my site). But not so “The Knowledge AI”. That was all it said, “The Knowledge AI”. It was very hard to Google, but I wouldn’t be surprised if it was OpenAI.
The earliest I can find “The Knowledge AI” crawling my site was April of 2018, and despite starting on April 16th, it was the second most active robot that month. In May it was the number one bot, and it stayed there through October of 2022, after which it pretty much dropped—from 32,000+ in October of 2022 to 85 in November of 2022 (about 4½ years). It was sporadic, showing up in single digit hits until January of 2024. It may be still crawling my site, but if it is, it is no longer identifying itself.
I don’t know if “The Knowledge AI” was an LLM company crawling, but if it was, not giving a link to explain the bot is suspicious. It’s the rare crawler that doesn’t identify itself with at least a URL to describe it. The fact that it took the number one crawling spot on my site for 4 ½ years is suspicious. As robots go, it didn’t affect the web server all that much (I’ve come across worse ones), and well over 90% of its requests were valid (unlike MJ12, which had a 75% failure rate). And my /robots.txt file doesn’t exclude any robot from scanning, so I can’t really complain about it.
My comment on “Mitigating SourceHut's partial outage caused by aggressive crawlers | Lobsters”
Even though the log data is a few years old, I don't think that IPs change from ASN to ASN all that much (but I could be wrong on that). I checked the IPs used by “The Knowledge AI” in May 2018, and in October 2022, and they didn't change that much. They were still the same /24 networks across that time.
Looking up the information today is very disappointing—Hurricane Electric LLC., a backbone provider.
So no real information about who “The Knowledge AI” might have been.
Sigh.
A deeper dive into mapping web requests via ASN, not by IP address
I went ahead and replaced IP addresses with ASNs in the log file to find the network that sent the most requests to my blog for the month of February.
MICROSOFT-CORP-MSN-AS-BLOCK, US | 78889 |
OVH, FR | 31837 |
ALIBABA-CN-NET Alibaba US Technology Co., Ltd., CN | 25019 |
HETZNER-AS, DE | 23840 |
GOOGLE-CLOUD-PLATFORM, US | 21431 |
CSTL, US | 17225 |
HURRICANE, US | 15495 |
AMAZON-AES, US | 14430 |
FACEBOOK, US | 13736 |
AKAMAI-LINODE-AP Akamai Connected Cloud, SG | 12673 |
Even though Alibaba US has the most unique IPs hitting my blog, Microsoft is still the network making the most requests. So let's see how Microsoft presents itself to my web server. Here are the user agents it sends:
agent | requests |
---|---|
Go-http-client/2.0 | 43236 |
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot) | 23978 |
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/116.0.1938.76 Safari/537.36 | 7953 |
Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0 | 2955 |
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36; compatible; OAI-SearchBot/1.0; +https://openai.com/searchbot | 210 |
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/1.0; +https://openai.com/bot | 161 |
DuckDuckBot/1.1; (+http://duckduckgo.com/duckduckbot.html) | 123 |
'DuckDuckBot-Https/1.1; (+https://duckduckgo.com/duckduckbot)' | 122 |
Python/3.9 aiohttp/3.10.6 | 28 |
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.6478.36 Safari/537.36 | 14 |
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.6422.114 Safari/537.36 | 14 |
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36 Edg/112.0.1722.68 | 10 |
DuckAssistBot/1.2; (+http://duckduckgo.com/duckassistbot.html) | 10 |
DuckAssistBot/1.1; (+http://duckduckgo.com/duckassistbot.html) | 10 |
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36 | 6 |
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.6422.143 Safari/537.36 | 6 |
python-requests/2.32.3 | 5 |
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.6422.142 Safari/537.36 | 5 |
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36 | 4 |
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:77.0) Gecko/20100101 Firefox/77.0 | 4 |
DuckDuckBot-Https/1.1; (+https://duckduckgo.com/duckduckbot) | 4 |
Twingly Recon | 3 |
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot) | 3 |
Mozilla/5.0 (compatible; Twingly Recon; twingly.com) | 3 |
python-requests/2.28.2 | 2 |
newspaper/0.9.1 | 2 |
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36 | 2 |
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534+ (KHTML, like Gecko) BingPreview/1.0b | 2 |
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36 | 2 |
http.rb/5.1.1 (Mastodon/4.2.10; +https://trystero.social/) Bot | 1 |
http.rb/5.1.1 (Mastodon/4.2.10; +https://trystero.social/) | 1 |
Mozilla/5.0 (Windows NT 6.1; WOW64) SkypeUriPreview Preview/0.5 skype-url-preview@microsoft.com | 1 |
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 | 1 |
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36 | 1 |
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36 Edg/112.0.1722.48 | 1 |
Mastodon/4.4.0-alpha.2 (http.rb/5.2.0; +https://sns.mszpro.com/) Bot | 1 |
Mastodon/4.4.0-alpha.2 (http.rb/5.2.0; +https://sns.mszpro.com/) | 1 |
Mastodon/4.3.3 (http.rb/5.2.0; +https://the.voiceover.bar/) Bot | 1 |
Mastodon/4.3.3 (http.rb/5.2.0; +https://the.voiceover.bar/) | 1 |
Mastodon/4.3.3 (http.rb/5.2.0; +https://discuss.systems/) Bot | 1 |
Mastodon/4.3.3 (http.rb/5.2.0; +https://discuss.systems/) | 1 |
The top result comes from a single IP address and probably requires a separate post about it, since it's weird and annoying. But the rest—you got Bing, you got OpenAI, you got several Mastodon instances—it seems like most of these are from Microsoft's cloud offering. A mixture of things.
What about Facebook?
agent | requests |
---|---|
meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler) | 13497 |
facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php) | 207 |
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/132.0.0.0 Safari/537.36 | 12 |
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36 | 4 |
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36 | 4 |
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36 | 4 |
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:58.0) Gecko/20100101 Firefox/59.0 | 4 |
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/132.0.0.0 Safari/537.36 Edg/132.0.0.0 | 2 |
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36 | 2 |
Hmm … looks like I have a few readers at Facebook, but other than that, nothing terribly interesting.
Alibaba,
on the other hand,
is frightening.
Out of 25,019 requests,
it presented 581 different user agents.
From looking at what was requested,
I don't think it's 500 Chinese people reading my blog—it's defintely bots crawling my site
(and amusingly, there are requests to /robots.txt
file,
but without a proper user agent to go by,
it's hard to block it via that file).
I can think of one conclusion here—if you do filter by ASN, it can help tremendously, but it also comes with possibly blocking legitimate traffic.
A different approach to blocking bad webbots by IP address
Web crawlers for LLM-based companies, as well as some specific solutions to blocking them, have been making the rounds in the past few days. I was curious to see just how many were hitting my web site, so I ran a few queries over the log files. To ensure consistent results, I decided to query the log file for last month:
total requests | 468439 |
unique IPs | 24654 |
IP | Requests |
---|---|
4.231.104.62 | 43242 |
198.100.155.33 | 26650 |
66.55.200.246 | 9057 |
74.80.208.170 | 8631 |
74.80.208.59 | 8407 |
216.244.66.239 | 5998 |
4.227.36.126 | 5832 |
20.171.207.130 | 5817 |
8.29.198.26 | 4946 |
8.29.198.25 | 4807 |
(Note: I'm not concerned about protecting any privacy here—given the number of results, there is no way these are any individual. These are all companies hitting my site, and if companies are mining their data for my information, I'm going to do the same to them. So there.)
But it became apparent that it's hard to determine which requests are coming from a single entity—it's clear that a company can employ a large pool of IP addresses to crawl the web, and it's hard to figure out what IPs are under control of which company.
Or is it?
An idea suddenly hit me—a stray thought from the days when I was wearing a network admin hat I recalled that BGP routing basically knows the network boundaries for networks as it's based on policy routing via ASNs. I wonder if I could map IP addresses to ASNs? A quick search and I found my answer—yes! Within a few minutes, I had converted a list of 24,654 unique IP addresses to 1,490 unique networks, I was then able to rework my initial query to include the ASN (or rather, the human readable version instead of just the number):
IP | Requests | AS |
---|---|---|
4.231.104.62 | 43242 | MICROSOFT-CORP-MSN-AS-BLOCK, US |
198.100.155.33 | 26650 | OVH, FR |
66.55.200.246 | 9057 | BIDDEFORD1, US |
74.80.208.170 | 8631 | CSTL, US |
74.80.208.59 | 8407 | CSTL, US |
216.244.66.239 | 5998 | WOW, US |
4.227.36.126 | 5832 | MICROSOFT-CORP-MSN-AS-BLOCK, US |
20.171.207.130 | 5817 | MICROSOFT-CORP-MSN-AS-BLOCK, US |
8.29.198.26 | 4946 | FEEDLY-DEVHD, US |
8.29.198.25 | 4807 | FEEDLY-DEVHD, US |
Now, I was curious as to how they identified themselves, so I reran the query to include the user agent string. The top eight identified themselves consistently:
Agent | Requests |
---|---|
Go-http-client/2.0 | 43236 |
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/132.0.0.0 Safari/537.36 | 26650 |
WF search/Nutch-1.12 | 9057 |
Mozilla/5.0 (compatible; ImagesiftBot; +imagesift.com) | 8631 |
Mozilla/5.0 (compatible; ImagesiftBot; +imagesift.com) | 8407 |
Mozilla/5.0 (compatible; DotBot/1.2; +https://opensiteexplorer.org/dotbot; help@moz.com) | 5998 |
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot) | 5832 |
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot) | 5817 |
The last two, however had a changing user agent string:
Agent | Requests |
---|---|
Feedly/1.0 (+https://feedly.com/poller.html; 16 subscribers; ) | 1667 |
Feedly/1.0 (+https://feedly.com/poller.html; 6 subscribers; ) | 1419 |
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 6 subscribers; ) | 938 |
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 16 subscribers; ) | 811 |
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 8 subscribers; ) | 94 |
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 37 subscribers; ) | 17 |
Agent | Requests |
---|---|
Feedly/1.0 (+https://feedly.com/poller.html; 16 subscribers; ) | 1579 |
Feedly/1.0 (+https://feedly.com/poller.html; 6 subscribers; ) | 1481 |
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 6 subscribers; ) | 905 |
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 16 subscribers; ) | 741 |
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 8 subscribers; ) | 90 |
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 37 subscribers; ) | 11 |
I'm not sure what the difference is between polling and fetching (checking the URLs shows two identical pages, only differing in “Poller” and “Fetcher.” But looking deeper into that is for another post.
The next request I did was to see how many IPs (that hit my site in February) map to a particular ASN, and the top 10 are:
AS | Count |
---|---|
ALIBABA-CN-NET Alibaba US Technology Co., Ltd., CN | 4034 |
AMAZON-02, US | 1733 |
HWCLOUDS-AS-AP HUAWEI CLOUDS, HK | 1527 |
GOOGLE-CLOUD-PLATFORM, US | 996 |
COMCAST-7922, US | 895 |
AMAZON-AES, US | 719 |
TENCENT-NET-AP-CN Tencent Building, Kejizhongyi Avenue, CN | 635 |
MICROSOFT-CORP-MSN-AS-BLOCK, US | 615 |
AS-VULTR, US | 599 |
ATT-INTERNET4, US | 472 |
So Alibaba US crawled my site from 4,034 different IP addresses—I haven't done the query to figure out how many requests each ASN did, but it should be a straightforward thing to just replace IP address with the ASN to get a better count of which company is crawling my site the hardest.
And now I'm thinking, I wonder if instead of a form of ad-hoc banning of single IP addresses, or blocking huge swaths of IP addresses (like 47.0.0.0/8, it might not be better to block per ASN? The IP to ASN mapping service I found makes it quite easy to get the ASN of an IP address (and to map the ASN to an human-readable name), Instead of, for example, blocking 101.32.0.0/16, 119.28.0.0/16, 43.128.0.0/14, 43.153.0.0/16 and 49.51.0.0/16 (which isn't an exaustive list by any means) just block IPs belonging to ASN 132203, otherwise known as “TENCENT-NET-AP-CN Tencent Building, Kejizhongyi Avenue, CN.”
I don't know how effective that idea is, but the IP-to-ASN site I found does offer the information via DNS, so it shouldn't be that hard to do.
Wednesday, March 19, 2025
How I vibe code
There's a new kind of coding I call "vibe coding", where you fully give in to the vibes, embrace exponentials, and forget that the code even exists. It's possible because the LLMs (e.g. Cursor Composer w Sonnet) are getting too good. Also I just talk to Composer with SuperWhisper so I barely even touch the keyboard. I ask for the dumbest things like "decrease the padding on the sidebar by half" because I'm too lazy to find it. I "Accept All" always, I don't read the diffs anymore. When I get error messages I just copy paste them in with no comment, usually that fixes it. The code grows beyond my usual comprehension, I'd have to really read through it for a while. Sometimes the LLMs can't fix a bug so I just work around it or ask for random changes until it goes away. It's not too bad for throwaway weekend projects, but still quite amusing. I'm building a project or webapp, but it's not really coding - I just see stuff, say stuff, run stuff, and copy paste stuff, and it mostly works.
Via Flutterby, Andrej Karpathy on X
Good Lord! If you thought software today was bloated and slow, this sounds like it would produce software that is gigantically glacial in comparison (and by “embrace exponentials” I think he means “accept code with O(n2), O(2n) or even O(n!) behavior”).
That's not how I would “vibe code.” No, to me, “vibe coding” is:
- Don't necessarily worry about the behavior of the code—make it work but at least try to avoid O(2n) or worse algorithms, then make it right, then fast.
- Don't use version control! If you make a mistake and need to revert, revert by hand, or carry on through the bad code. And avoid using directores like “src.1/”, “src.2/“ or “src-no-really-this-works/”—that's still a form of version control (albeit a poor man's version control). Power through your mistakes.
- Don't bother with “unit tests,” “integration tests,” TDD or even BDD. I'm not saying don't test, just don't write tests. Want to refactor? Go ahead—bull through the changes, or don't. It's your code. Yes, this does mean mostly manual testing, and having a file of test data is fine—just don't write test code.
- Format the code however you want! Form your own opinions on formatting. Have some soul in your code for once.
- This isn't a team sport, so no pair programming! This is vibe coding, not vibe partying.
- Remember the words of Bob Ross: “we don't make mistakes, just happy little accidents.”
- Go with the flow. Just Do It™!
Now that I think about it, this is pretty much how programmers wrote code on home computers in the late 70s/early 80s. Funny that. But just blindly accepting LLM-written code? Good luck in getting anything to run correctly.
Sheesh.
Tuesday, March 18, 2025
Who serves whom?
The narrative around these bots is that [AIs] are there to help humans. In this story, the hospital buys a radiology bot that offers a second opinion to the human radiologist. If they disagree, the human radiologist takes another look. In this tale, AI is a way for hospitals to make fewer mistakes by spending more money. An AI assisted radiologist is less productive (because they re-run some x-rays to resolve disagreements with the bot) but more accurate.
In automation theory jargon, this radiologist is a "centaur" – a human head grafted onto the tireless, ever-vigilant body of a robot
Of course, no one who invests in an AI company expects this to happen. Instead, they want reverse-centaurs: a human who acts as an assistant to a robot. The real pitch to hospital is, "Fire all but one of your radiologists and then put that poor bastard to work reviewing the judgments our robot makes at machine scale."
Pluralistic: AI can't do your job (18 Mar 2025) – Pluralistic: Daily links from Cory Doctorow
This has always been my fear of the recent push of LLM backed AI—not that they would help me do my job better, but that I existed to help it do its job better (if I'm even there).
A network of bloggers, a reel of YouTubers and other collective nouns
While I just made up the “network of bloggers” and “reel of YouTubers,” other collective nouns for groups, like a gaggle of geese, a murder of crows, or a pod of whales, are not quite as old as they may seem, and were largely made up just a few hundred years ago, and there were a lot more than we use today, according to this video. Neat.
Measuring the cosmos, part II
Last month, I mentioned part one of how we measured the night sky, and now, part two of “Terence Tao on how we measure the cosmos”.
Monday, March 03, 2025
A quirk of the Motorola 6809 assemblers
I just learned an interesting bit of trivia about 6809 assembly language on a Discord server today.
When Motorola designed the 6809 assembler,
they made a distinction between the use of n,PC
and n,PCR
in the indexing mode.
Both of those make a reference based off the PC
register,
but in assembly language they defined,
using n,PC
means use the literal value of n as the distance,
whereas n,PCR
means generate the distance between n and the current value of the PC
register.
I never knew that.
I just looked and all the materials I had on the 6809 use the n,PCR
method everywhere,
yet when I wrote my assembler,
I only support n,PC
and it always calculates the distance.
I think I forgot that it should have been n,PCR
because on the 68000
(which I also programmed,
and was also made by Motorola) it always used n,PC
.
And I don't think I'll change my assembler as there does exist a method to use an arbitrary value of n as a distance:
LDA (*+3)+
n,PC
.
The asterisk evaluates to the address of the current instruction,
and by adding 3 you get the address of the next instruction,
which in the PC
-relative addressing mode,
is a distance of 0.
Then n will be the actual offset used in the instruction.
Yes,
it's a bit convoluted,
but it's a way to get how Motorola originally defined n,PC
.
And apparently, Motorola defined it that way to make up for less intelligent assemblers back in the day due to memory constraints. We are long past those days.
Yelling at clouds
I will admit—these are kneejerk reactions, but they're honestly my reactions to reading the following statements. I know, I know, hanging onions off our belt is long out of style.
And get off my lawn!
Anyway … statment the first:
Think
jq
, but without having to ask an LLM to write the query for you.
Via Lobsters, A float walks into a gradual type system
So … using jq
is so hard you need to use a tool that will confabulate ¼ of the time in order to construct a simple query?
Is that what you are saying?
That you can't be bothered to use your brain?
Just accept the garbage spewed forth by a probabilistic text slinger?
Really?
And did you use an LLM to help write the code? If not, why not?
Sigh.
And statement the second:
… and most importantly, coding can be social and fun again.
Via Lobsters, introducing tangled
If I had known that programming would become a team sport, I, an introvert, would have choosen a different career. Does XXXXXXX everything have to be social? Why can't it just be fun? I need to be micromanaged as well?
Saturday, March 01, 2025
Fixing a 27 year old bug that only now just got triggered
I will, from time to time, look at various logs for errors. And when I looked at the error log for my web server, intermixed with errors I have no control over like this:
[Tue Feb 25 10:41:19.504140 2025] [ssl:error] [pid 16571:tid 3833293744] [client 206.168.34.92:47678] AH02032: Hostname literature.conman.org provided via SNI and hostname 71.19.142.20 provided via HTTP have no compatible SSL setup [Tue Feb 25 12:39:33.768053 2025] [ssl:error] [pid 16408:tid 3892042672] [client 167.94.146.59:50798] AH02032: Hostname hhgproject.org provided via SNI and hostname 71.19.142.20 provided via HTTP have no compatible SSL setup [Sat Mar 01 05:34:44.029898 2025] [core:error] [pid 21954:tid 3841686448] [client 121.36.96.194:53710] AH10244: invalid URI path (/cgi-bin/.%2e/.%2e/.%2e/.%2e/.%2e/.%2e/.%2e/.%2e/.%2e/.%2e/bin/sh) [Sat Mar 01 05:34:45.077056 2025] [core:error] [pid 23369:tid 3875257264] [client 121.36.96.194:53722] AH10244: invalid URI path (/cgi-bin/%%32%65%%32%65/%%32%65%%32%65/%%32%65%%32%65/%%32%65%%32%65/%%32%65%%32%65/%%32%65%%32%65/%%32%65%%32%65/bin/sh)
I found a bunch of errors that I found concerning:
[Sun Feb 23 10:14:54.644036 2025] [cgid:error] [pid 16408:tid 3715795888] [client 185.42.12.144:51022] End of script output before headers: contact.cgi, referer: https://www.hhgproject.org/contact.cgi contact.cgi: src/Cgi/UrlDecodeChar.c:41: UrlDecodeChar: Assertion `((*__ctype_b_loc ())[(int) ((*src))] & (unsigned short int) _ISxdigit)' failed.
It's obvious that a call to assert()
failed in the function UrlDecodeChar()
due to some robot failing to encode a web request properly.
Let's see what the code is actually doing:
char UrlDecodeChar(char **psrc) { char *src; char c; assert(psrc != NULL); assert(*psrc != NULL); src = *psrc; c = *src++; if (c == '+') c = ' '; else if (c == '%') { assert(isxdigit(*src)); assert(isxdigit(*(src+1))); c = ctohex(*src) * 16 + ctohex(*(src+1)); src += 2; } *psrc = src; return(c); }
The problem was using assert()
to check the results of some I/O—that's not what assert()
is for.
I think I was being lazy when I used those assertions and didn't bother with the proper coding practice of returning an error.
Curious as to when I added this code,
I checked the history and from December 3rd, 2004:
char UrlDecodeChar(char **psrc) { char *src; int c; ddt(psrc != NULL); ddt(*psrc != NULL); src = *psrc; c = *src++; if (c == '+') c = ' '; else if (c == '%') { ddt(isxdigit(*src)); ddt(isxdigit(*(src+1))); c = ctohex(*src) * 16 + ctohex(*(src+1)); src += 2; } *psrc = src; return(c); }
The history in the current repository goes no further back due to losing my CVS repositories and it's interesting to see that this function is the same as it was back then
(with the difference of using my own version of assert()
called ddt()
back in the day).
Some further sluthing convinced me that I wrote this code back in 1997.
This function is old enough to not only vote,
be drafted,
get drunk,
and sign contracts,
but be removed from its parents health insurance!
Good lord!
It's not how I would write that function today.
It's even more remarkable that I haven't seen this assert()
trigger in all those years.
The fix was easy:
char UrlDecodeChar(char **psrc) { char *src; char c; assert(psrc != NULL); assert(*psrc != NULL); src = *psrc; c = *src++; if (c == '+') c = ' '; else if (c == '%') { if (!isxdigit(*src)) return '\0'; if (!isxdigit(*src+1)) return '\0'; c = ctohex(*src) * 16 + ctohex(*(src+1)); src += 2; } *psrc = src; return(c); }
And propagating the error back up the call chain. This does result in a new major version for CGILib since I do follow semantic versioning since this is, technically speaking, a change in the public API even though this is less than 10 lines of code (out of 8,000+).
Tuesday, February 11, 2025
I never got the memo on “copyover servers”
There’s only so much you can do with builder rights on someone else’s MUD. To really change the game, you needed to be able to code, and most MUDs were written “real languages” like C. We’d managed to get a copy of Visual C++ 6 and the CircleMUD source code, and started messing about. But the development cycle was pretty frustrating — for every change, you had to recompile the server, shut it down (dropping everyone’s connections), bring it back up, and wait for everyone to log back in.
Some MUDs used a very cool trick to avoid this, called “copyover” or “hotboot”. It’s an idiom that lets a stateful server replace itself while retaining its PID and open connections. It seemed like magic back then: you recompiled the server, sent the right command, everything froze for a few seconds, and (if you were lucky) it came back to life running the latest code. The trick is simple but I can’t find a detailed write-up, so I wanted to write it out while I thought of it.
Via Lobsters, How Copyover MUD Servers Worked | Blog | jackkelly.name
Somehow, in all my years of programming (and the few years I was looking into the source code of various MUDs back in the early 90s) I never came across this method of starting an updated version of a server without losing any network connections. In hindsite, it's an obvious solution—it just never occured to me to do this.
Discussions about this entry
Two videos on how we figured out our solar system just based on obversations alone, long before we left the surly bonds of Earth
The video “Terence Tao on how we measure the cosmos” was very interesting to watch, as Terence goes into depth on how people in the past, and by past, I mean the distant past, figured out the earth was a sphere, how big that sphere was, and even reasoned that the earth went around the sun, long before the Christian Church even existed! And the method that Kepler used to figure out the orbits of Earth and the planets, when at the time we didn't quite know the distance to them, and all we had were positions in the sky to go by.
Incredible.
Also, a second video on how the moons of Jupiter (yes, it's not at all about Pluto despite the title) revealed much about how our solar system works. It even revealed that light had a finite speed.
I think if these methods were more widely known, how we figured out the shape of the Earth, the size of the moon and sun, and how orbits worked, then people wouldn't have the mistaken belief of a flat earth holding up the firmaments.
Update on Tuesday, March 18th, 2025
Part Two of “Terence Tao on how we measure the cosmos” has been released.
Tuesday, February 04, 2025
Concurrency is tricky
As I was writing the previous entry I got the nagging feeling that something wasn't quite right with the code. I got distracted yesterday helping a friend bounce programming issues off me, but after that, I was able to take a good look at the code and figured out what I did wrong.
Well, not “wrong” per se, the code as it worked—it's just that it could fail catastrophically in the right conditions (or maybe wrong conditions, depending upon your view).
But first, a bit about how my network sever framework works. The core bit of code is this:
local function eventloop(done_f) if done_f() then return end -- calculate and handle timeouts -- each coroutine that timed out is -- scheduled to run on the RUNQUEUE, -- with nil and ETIMEDOUT. SOCKETS:wait(timeout) for event in SOCKETS:events() do event.obj(event) end while #RUNQUEUE > 0 do -- run each coroutine in the run queue until -- it eithers yields or returns (meaning -- it's finished running). end return eventloop(done_f) end
Details are emitted
full gory details here
but in general,
the event loop calls a passed in function to check if we need to shut down,
then calculates a timeout value while checking for coroutines that registered a timeout.
If any did,
we add the coroutine to a run queue with nil
and ETIMEDOUT
to inform the resuming coroutine that it timed out.
Then we scan a set of network sockets for activity with SOCKETS:wait()
(on Linux, this ends up calling epoll_wait()
; BSDs with kqueue()
and most other Unix systems with poll()
).
We then call the handling function for each event.
These can end up creating new coroutines and scheduling coroutines to run
(these will be added to the run queue).
And then for each coroutine in the run queue,
we run it.
Lather,
rinse,
repeat.
Simple enough.
Now,
on to the code I presented.
This code registers a function to run when the given UDP socket recieves a packet of data,
and schedules a number of coroutines waiting for data to run.
This happens in the eventloop()
function.
nfl.SOCKETS:insert(lsock,'r',function() local _,data,err = lsock:recv() if data then for co in pairs(clients) do nfl.schedule(co,data) -- problem end else syslog('error',"recv()=%s",errno[err]) end end)
I've noted a problematic line of code here.
And now the core of the routine to handle a TLS connection. This code yields to receive data, then writes the data to the TLS socket.
while true do local data = coroutine.yield() if not data then break end local okay,errmsg = ios:write(data,'\n') -- <<< HERE if not okay then syslog('error',"tls:read() = %s",errmsg) break end end
I've marked where the root cause lies,
and it's pretty subtle I think.
The core issue is that ios:write()
here could block,
because the kernel output buffer is full and we need to wait for the kernel to send it.
But the code that handles the UDP socket just assumes that the TLS coroutine is ready for more data.
If ios:write()
blocks and more UDP data comes on,
the coroutine is prematurely resumed with the data,
but that's just taken by the TLS thread as the write being successful,
then yielding and then things get … weird,
as the UDP side and the TLS side are now out of sync with each other.
This,
fortunately,
hasn't trigger on me.
Yet.
It could,
if too much was being logged to syslog
.
I wrote the following code to test it out:
#include <syslog.h> #define MSG " !\"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXUZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~" int main(void) { int i; for (i = 0 ; i < 500 ; i++) syslog(LOG_DEBUG,"%3d " MSG MSG MSG,i); return 0; }
And sure enough,
the spice data stopped flowing.
What I needed to do was queue up the log messages to a given client, and only schedule it to run when it's waiting for more data. A few failed attempts followed—they were all based on scheduling the TLS thread when X number of messages were queued up (I tried one, then zero; neither worked). It worked much better by using a flag to indicate when the TLS coroutine wanted to be scheduled or not.
The UDP socket code is now:
nfl.SOCKETS:insert(lsock,'r',function() local _,data,err = lsock:recv() if data then for co,queue in pairs(clients) do table.insert(queue,data) if queue.ready then nfl.schedule(co,true) end end else syslog('error',"recv()=%s",errno[err]) end end)
The client list now contains a list of logs to send, along with a flag that the TLS coroutine sets indicating if it needs running or not. This takes advantage of Lua's tables which can have a hash part (named indices) and an array part, so we can include a flag in the queue.
And now the updated TLS coroutine:
local function client_main(ios) local function main() while #clients[ios.__co] > 0 do local data = table.remove(clients[ios.__co],1) local okay,err = ios:write(data,'\n') if not okay then syslog('error',"tls:write()=%s",err) return end end clients[ios.__co].ready = true if not coroutine.yield() then return end clients[ios.__co].ready = false return main() end ios:_handshake() if ios.__ctx:peer_cert_issuer() ~= ISSUER then ios:close() return end syslog('info',"remote=%s",ios.__remote.addr) clients[ios.__co] = { ready = false } main() clients[ios.__co] = nil syslog('info',"remote=%s disconnecting",ios.__remote.addr) ios:close() end
The core of the routine,
the nested function main()
does the real work here.
When main()
starts,
the flag for queue readiness is false
.
It then runs through its input queue sending data to the client.
Once that is done,
it sets the queue readiness flag to true
and then yields.
Once it resumes,
it sets the queue readiness flag to 'false' and
(through a tail call)
starts over again.
This ensures that logs are queued properly for delivery, and running the C test program again showed it works.