The Boston Diaries

The ongoing saga of a programmer who doesn't live in Boston, nor does he even like Boston, but yet named his weblog/journal “The Boston Diaries.”

Go figure.

Thursday, June 05, 2025

Avoiding Roko's basilisk, part II

The other day I came across this comment on Lobsters:

On a personal level I have helped various people get value out of AI tools where they initially did not understand how to use it properly. But that setting is more of a 1:1 for a specific situation. For generic how to use agentic tools, there are so many articles already. Peter Steinberger has a multi hour talk online of him using an army of agents to write on his project.

If someone has a specific situation where they failed using an agent, ideally with some open source code, I would be happy to have a look at it. It’s just hard to engage on abstract “does not work for me” posts.

Comment on “AI Changes Everything”

I failed using an agent a few months ago. It was on an open source project of mine. Perhaps mitsuhiko would be happy to have a look at it. So I replied.

And mitsuhiko was happy to look at it.

Or rather, spend a few minutes telling his “coding agent” to look at the code and let it do its thing. So I took a look.

Development was done on a Mac, which doesn't have the vm86() system call, so his agent, “Claude,” started writing an 8086 emulator. Or I should say, an 80386 emulator since that's the most common architecture these days. It also came up with a few tests and once it those tests were working, it stopped.

When I tried the code, attempting to run RACTER.EXE, it just sat there, turning my computer into a space heater. Looking a bit further, I saw there was an option for debug output (but the option appears at the end of the command line, not after the command itself, like every other command on Unix). Then I saw line after line of

...
Execute: 2010:0020: 8B
Unhandled opcode at 2010:0020: 8B
Execute: 2010:0021: EC
Unhandled opcode at 2010:0021: EC
Execute: 2010:0022: 81
Unhandled opcode at 2010:0022: 81
Execute: 2010:0023: EC
Unhandled opcode at 2010:0023: EC
Execute: 2010:0024: 02
Unhandled opcode at 2010:0024: 02
Execute: 2010:0025: 00
Unhandled opcode at 2010:0025: 00
Execute: 2010:0026: 9A
Unhandled opcode at 2010:0026: 9A
Execute: 2010:0027: C2
Unhandled opcode at 2010:0027: C2
Execute: 2010:0028: 10
Unhandled opcode at 2010:0028: 10
Execute: 2010:0029: 52
Unhandled opcode at 2010:0029: 52
Execute: 2010:002A: 24
Unhandled opcode at 2010:002A: 24
Execute: 2010:002B: 9A
Unhandled opcode at 2010:002B: 9A
Execute: 2010:002C: A2
Unhandled opcode at 2010:002C: A2
Execute: 2010:002D: 19
Unhandled opcode at 2010:002D: 19
Execute: 2010:002E: 52
Unhandled opcode at 2010:002E: 52
...

To say I was underwhelmed is an understatement.

The thread somewhat petered out.

I noticed today that mitsuhiko gave it another attempt. He put the whole thing into Docker so he could run under a Linux VM, and the code now could run enough of RACTER.EXE to display the banner:

[spc]lucy:/tmp/racter>/tmp/NaNoGenMo-2015/C/msdos RACTER.EXE



          .-----------------------------------------------------,
          |                                                     |
          |            A CONVERSATION WITH RACTER               |
          |                                                     |
          |       COPYRIGHTED BY INRAC CORPORATION, 1984        |
          | PORTIONS COPYRIGHTED BY MICROSOFT CORPORATION, 1982 |
          |                   ...........                       |
          `-----------------------------------------------------'




Hello, I'm Racter.  You are?  
>Sean
Sean

But that's it. It's still chugging along, turning my computer into a space heater. I'm still unimpressed.

This isn't to fault mitsuhiko. I'm sure he finds value in AI agents coding for him, but I think this was way out of his bailiwick, which is why he didn't bother to understand what I was trying to attempt. “Claude” got to the point of printing the banner from RACTER.EXE and stopped, because I think that's all it was instructed to do, besides attempting to buffer the input.

I'll close this out with the last few comments in the thread:

Sean
What type of programming do you do? Or rather, what type of programming do you have Claude do for you? Because I am still unconvinced it will be any benefit to the programming I do.
mitsuhiko
Right now I’m building a backend for a prototype of the next project I’m working on. That is a rather complex web application using both Python and Rust. Over the last year or so I used it quite a bit to extend minijinja (but that wasn’t agentic yet).
Sean
Ah, stuff that is definitely over-represented in the training sets. Gotcha.
mitsuhiko
Considering that I’m doing a very fringe thing I’m not so sure that this is a very accurate assessment :)
Sean
Python, Rust and web applications are over-represented in the training sets. The 6809, RACTER.EXE and ANS Forth aren’t. What you are doing might be novel, but the tech being used isn’t. The stuff I described isn’t novel (well, maybe having RACTER and Eliza chat, but I was riffing on an article written in the 80s about doing that) but using tech that (in my opinion) is novel (that is, not mainstream). There’s a difference.

I do appreciate the attempt though.

Update on Friday, June 6th, 2025 at 3:06 AM

One last comment from mitsuhiko in the thread: “I had excellent results with completely niche technology too. For as long as you have a way for the machine to validate it’s [sic] outputs it can even program in languages that you just invented.”

I think I'll have to keep this in mind for next time.

Wednesday, June 04, 2025

The basics of an indirect threaded code ANS Forth implementation

I need to get into the habit of writing prose more often.

Before I go into the implementation of my ANS Forth system, I need to define somethings. First, a bit about Forth terminology—a Forth “word” can be thought of as a function or subroutine of other languages, but it's a bit more than that—it can also refer to a variable, a constant, the equivalent of a key word. It's a fluid concept, but thinking of a “word” as a function won't be too far off. Also, a collection of Forth words is collected into a wordset (or if you are reading older documentation about Forth, a “dictionary”). You can think of a “dictionary” as a library, or maybe a module, of code.

Second, Forth is a stack-based language. There are rarely any explicit parameters, just data placed on a stack, known as the “data stack.” Because of this, expressions are written in Reverse Polish Notation (RPN)—data is specified first, then the operator.

Third, there's a second stack, the “return stack” that is pretty much what it sounds like—it records the return address when calling into a Forth word.

Forth, the langauge is ridiculously easy to parse—a token is just a collection of characters separated by space. So not only is my_variable_name a valid name for a Forth word, so is my-variable-name, #array, 23-skidoo and even 3.14 valid names for Forth words. Yes, that means it's easy to do stupid things like define the token “1” to be 2 (you know, for when 1 equals two for large values of 1). But this also makes Forth trivial to parse. In fact, a Forth interpreter is nothing more than:

begin
	name = parse_token()
	word = lookup(name)
	if (word)
		execute(word)
	else
	{
		if (valid_number(name,current_base))
			push_number(convert(name,current_base))
		else
			error()
	}
repeat

That's it. That's a Forth interpreter more or less. Compiling Forth isn't much harder:

begin
	name = parse_token()
	word = lookup(name)
	if (word)
	{
		if (immediate(word))
			execute(word)
		else
			compile_into_definition(word)
	}
	else
	{
		if (valid_number(name,current_base))
			add_code_to_push(convert(name,current_base))
		else
			error()
	}
repeat

A Forth word can be marked as “immediate,” which just means the word is executed during compilation mode rather than being compiled, and this is how the Forth compiler can be extended. A Forth word like IF or BEGIN is just another Forth word, albeit marked as “immediate” so it can do its job when compiling.

The details about switching between interpreting and compiling will be covered in a later post, but it's not a difficult as it may seem.

And one bit about ANS Forth in particular—the standard defines a collection of “wordsets,” most of which are optional in an implementation. The minimum required is the Core Word set. The rest, including the Core Extension words, are optional. I did not implement all the wordsets, but again, more on that later.

Anyway, my ANS Forth system is a classic indirect threaded code (ITC) implementation. As I mentioned, it's easy to implement but perhaps not the fastest of implementation styles. As this is on the MC6809, I used a typical-for-the-6809 register use for my Forth system:

Register Type Forth usage
D 16-bit accumulator Free for use (top of stack is kept in memory)
X 16-bit index register execution token of word being run, free for use
Y 16-bit index register Forth IP register
U 16-bit index register/user stack data stack pointer
S 16-bit index register/system stack return stack pointer

I elected to keep the top of stack in memory and not in the D register. I'm not sure if this effects the speed any, but it was easier, implementation wise, to keep the top of stack always in memory.

The Forth words in my implementation are either primitives, that is, written in assembly language, or secondaries, that is, written in Forth. Here's the format of a primitive word, +:

forth_core_plus                 ;  n1|u2 n2|u2 -- n3|u3 )
                fdb     forth_core_star_slash_mod	; previous word in dictionary
                fdb     .xt - .name			; length of name
.name           fcc     "+"				; name
.xt             fdb     .body				; execution token of word
.body           ldd     ,u++				; body of word
                addd    ,u
                std     ,u
                ldx     ,y++
                jmp     [,x]

The first thing to notice is the label. Given how difficult naming is in Computer Science, I decided to use the canonical name of each Forth Word as defined by the standard. All the labels start with “forth_,” and are followed by the wordset (in this case, “core_”) followed by the actual Forth word. Yes, this makes for some long labels, but I don't have to think about naming, which is nice. The .xt label (which is a “local label”—this defines the full label of forth_core_plus.xt) defines the “execution token” (xt) which is defined as “a value that identifies the execution semantics of a definition.” Here, this is a pointer to a function that provides the execution semantics of the word and in this case, just points to the “body” of the Forth word, which adds the top two elements of the data stack.

Of particular note are the last two instructions:

	ldx	,y++
	jmp	[,x]

Get used to seeing this fragment, it's used a lot—this is used to execute the next word in the program. As stated above, the Y register is the Forth IP, and this loads the X register with the xt of the next word, then jumps to the code that handles this word. The [,X] bit informs the CPU that we are jumping through a function pointer and not directly to the code (if we did that, JMP ,X, that would turn this from “indirect threaded code” to “direct threaded code”—and yes, that is the only difference between an ITC and DTC implementation).

The overall format of a word is a link to the previous word in the dictionary (here to */MOD), followed by a 16-bit length, the text that makes up the word, followed by the xt and then the body of the definition. You might be wondering why I would use a full 16-bits for the length on an 8-bit CPU—wouldn't that waste space? Yes, it would, especially given that in this implementation, a Forth word is restricted to just 31 characters, but I need a way to mark some information about each word, like if it's an immediate word or not. And while an 8-bit length where the largest value would be 31 giving me three bits for flag values, I ended up needed a few more than just three. So 16 bits for the length gives me a potential of 11 bits to use as flags.

A word written in Forth will have a different xt and body, for example, the very next word in the dictionary:

forth_core_plus_store           ; ( n|u a-addr -- )
                fdb     forth_core_plus
                fdb     .xt - .name
.name           fcc     "+!"
.xt             fdb     forth_core_colon.runtime
        ;===============================
        ; : +!  DUP @ ROT + SWAP ! ;
        ;===============================
                fdb     forth_core_dupe.xt
                fdb     forth_core_fetch.xt
                fdb     forth_core_rote.xt
                fdb     forth_core_plus.xt
                fdb     forth_core_swap.xt
                fdb     forth_core_store.xt
                fdb     forth_core_exit.xt

The xt here points to the label forth_core_colon.runtime. This is usually called DOCOLON but I'm being explicit here—Forth words defined in Forth are created by the Forth word :, and this label, forth_core_colon.runtime, implements the runtime portion of said word. The body of this word is then an array of execution tokens of various Forth words comprising the definition. The last word of all Forth words defined this way end with a call to corth_core_exit.xt.

The : runtime function looks like:

forth_core_colon		; ( C: "<spaces>name" -- colon-sys ) E ( i*x -- j*x )
		fdb	forth_core_two_swap
		fdb	.xt - .name
.name		fcc	":"
.xt		fdb	.body
.body		...		; I'll get to this bit of the code in a later post

.runtime	pshs	y
		leay	2,x
		ldx	,y++
		jmp	[,x]

Here, the runtime will push the Forth IP (which is the Y register) onto the return stack, set the Y register to the body of the word being executed and that two instruction sequence that goes to the next word to execute.

And the function EXIT looks like:

forth_core_exit			; E ( -- ) ( R: nest-sys -- )
		fdb	forth_core_execute
		fdb	_NOINTERP :: .xt - .name
.name		fcc	"EXIT"
.xt		fdb	.body
.body		puls	y	; restore the Forth IP
		ldx	,y++	; and execute next word
		jmp	[,x]

The first thing to notice here is the _NOINTERP flag. EXIT is defined as having no interpretation semantics, so typing EXIT outside of a word being defined is meaningless. Yes, this flag is used, and it does generate an error, but that again, is a later post. I should also mention that the :: here is a special operator in my assembler. The left hand side defines the most significant byte of a 16-bit quantity, and the right hand side defines the least significant byte. It's short hand for _NOINTERP * 256 + (.xt - .name).

The second thing to notice is that the Forth IP (again, the Y register) is pulled from the return stack, and we yet again, run that two instruction sequence to run the next word.

And this is pretty much the entire execution engine of Forth.

In fact, I wrote this bit first, even before writing code to compile Forth (and yes, I hand-compiled all Forth code, so I had something to compare against when I eventually got to compiling).

So on the one hand, Forth is an easy language to implement and can be quite small. On the other hand, ANS Forth has some subtle semantics that make for some … interesting implementation details and isn't that easy to implement, as we shall see over the coming posts.


Discussions about this entry

Monday, June 02, 2025

I've just implemented a Forth system

It started out innocently enough—I just wanted to know how to implement the Forth word DOES>. I ended up implementing ANS Forth for the 6809, as you do.

I've had a fascination with Forth since college. I have a copy of both the first edition and the second edition of Starting Forth. I have a copy of Thinking Forth I also have a copy of Threaded Interpretive Languages with the Robert Tinney cover art. I even wrote my own Forth-like langauge in college, which I used for a class project and a few work-relelated programs.

But the one concept I couldn't figure out how to implement was the Forth word DOES>. As a user of Forth, using DOES> is pretty easy and one of those things never thought about, much like closures to a JavaScript programmer. But implementing it? That was a problem.

So in April I set out to do that, using the 6809 (because why not?) with the intent of figuring it out. I made a stab with it a few years ago, writing just enough of a Forth implementation in C and got it working. Barely. I wanted to do better.

And boy, did I.

I didn't intend on implementing ANS Forth, but once I “got” DOES> done (and I know the grammar on that doesn't quite work, but that's Forth for you) I had enough of a system up that sure—why not finish it?

It's a classical indirect thread coded Forth interpreter. They tend to be the easiest to write and the most compact. They aren't exactly the fastest though. And the 6809 is unique among the 8-bit CPUs in that Forth is a very good match for it. Forth requires two stacks, the 6809 has two stack pointers. There are two index registers, so one can be used as the Forth instruction pointer, and the other one for use. It even has some limited 16-bit arithmatic operations. So it's a good match for Forth.

It just took a bit longer than expected. I ended up implementing 254 Forth words (in Forth, a “word“ is like a function) out of the possible 435—I wanted a Forth independent of any existing operating system, so I skipped implementing a few wordsets (Forth jargon for “module” or “library”). The only routines that need to be provided are a character input routine, a character output routine, and a routine to return back to the operating system. I also didn't implement floating point, as that would take up a considerable amount of space (the IEEE-754 routines for the 6809 Motorola placed into the public domain clock in at 8K and all that gives you is addition, subtraction, multiplcation, division and square roots—and the ANS Forth standard for floating point requires a lot more).

I also tried writing as much of it using ANS Forth standard words as possible, only it turned out to be less than I thought using the restrictions I placed upon myself (mainly—avoid using non-standard Forth words but more on that in a later post).

Then it took a while to get it to pass the test suite—the semantics of some Forth words are a bit tricker than I expected, and some words worked completely differently than how I expected them to work, but again, I'll be going into detail about that later. I also had to debug the test suite as it made some unwarrented assumptions about the Forth environment, mainly case-insensitivity (mine is case-sensitive, which is allowed by the Forth standard) and line length (mine is limited to 80 characters, again, the minimum allowed by the Forth standard). And the occastional outright bug (mostly typos).

Fun times.

Anyway, expect a flurry of posts about implemeting an ANS Forth system.

Wednesday, April 16, 2025

Extreme automobiles, Ft. Lauderdale edition

Bunny and I stopped by Josef & Joseph to have them fix Bunny's watch (it turned out to just need a new battery). Inside is a large collection of clocks, watches and jewelry. Outside, however, was this!

[A silver car, longer than a limousine with an assortment of ornamants adorning the outside] “That's not a limousine, this is a limousine!” [The frontend of a silver car with nine, count them, nine, headlights spanning the front grill] I don't think I'd want this barrelling towards me from behind at night! [The other side of a three-times longer than normal silver car with three baloon-like things leaned up against it] This is so unusual of a car that three aliens have come down to examine it closely. [Closeup detail of the car, sporting emblems from VW, BMW, and Cadilac, with a model air plane strapped to the top, all silver] I don't think those wings are capable of supporting flight.  Probably for the best.

When I asked about the car inside, I was informed that indeed, it could drive, so it's a working car, made of a VW, BMW and Cadilac welded together and painted silver. And it was delivered as part of an Elvis Presley exhibit at Josef & Joseph. What Elvis has to do with a Mad-Max inspired limousine is beyond me. But it's there, and it's awesome.

It's also been there so long you can see it on Google's street view!

Thursday, April 03, 2025

God, I feel like I'm an old man wearing a tin foil hat yelling at the world

A few months ago our ice maker broke and the upshot—we got charged for a replacement unit that was never installed and we're using ice trays.

Last week, our dryer stopped working. It turned on, you can set the controls, but it fails to spin up and do any actual drying. We thought about maybe getting it repaired, but in all likelyhood, by the time we pay for the repair man and the replacement widget that isn't working, we would probably be looking at the price of a new dryer anyway. And if it can't be fixed, then we're definitely out the prirce of a new dryer and the repair man.

So early this week we went out to one of the two remaining home improvement stores (and it wouldn't surprise me to find out they're both owned by the same shadow company—can you tell I'm gettin cynical here?) and bought a replacement. Of course, we couldn't just buy a new dryer—no. We had to buy a new electrical cord (what? It doesn't come as part of the dryer?) and a new dryer hose (what? What's wrong with the one we already have? This stuff is pretty much standardized by now). Then there was the delivery fee, the installation fee, and the removal-of-the-old-unit fee. All that tacked on an additional 25% on the shelf price, not including taxes.

Maybe it would have been cheaper to get it repaired.

Anyway, today we received the unit. Two men unloaded the new dryer from the truck, brought it inside the garage, took the old dryer off the “pedestal” (which we purchased several years ago when we got a new washer and dryer—each are on a pedestal that acts as storage space), and then said they would NOT install the new dryer on the old pedestal. In fact, they insisted they COULD NOT install on the older pedestal, and then they left.

Well, thank you very much.

We called the home improvement store and they said it was company policy not to install a dryer on a pre-existing pedetal. They didn't say which company mandated this policy—the manufacturer of the dryer (which, at this point, I wouldn't be surprised if there was only one shadow company that owned all the appliance manufacturers) or the home improvement store. And of course, the removal of the pedestal would require an additional removal-of-the-old-unit fee, because it was considered a separate unit. It didn't matter that we had a perfectly good pedestal whose size matches the new dryer. And it didn't matter that the new dryer hose matched the size of the existing hose. Oh, and if we installed the dryer ourselves, we would immediately void the one-year warrantee on the dryer.

It seems like our only real choices were to spend even more money on a new pededstal, more money for delivery, more money to remove the old pedestal, more money to install the new pedestal. and probably an additional fee to install the new dryer (which we had already paid for installation) to install the dryer on said pedestal, or we could elect to void the warrantee and install the dryer on our own.

Well, XXXX me!

Bunny thinks I'm too cynical, but at this point, how do you not be cynical?

We ended up calling a friend over to help install the dryer on the old pedestal. It fit fine. So did the old hose. I'm not sure if it's even worth attempting to get the installation fee back—they might just void our warrantee on the spot (I can hear Bunny calling out “Stop being so cynical!”). I'm glad we didn't spring for the three or five year warrantees—the unit will probably last that long (and no longer, because then the same shadow company that owns all the appliance manufacturers gets another sale, and thus the line goes up, because XXXX you, think of the poor billionaires).

I hope it lasts longer. I don't want to be this cynical.

Wednesday, March 26, 2025

Notes on blocking spam by filtering on ASN

So now that I can classify IP addresses by ASN, I thought I might see how it could help with spam email. I'm already using an ansi-spam agent to cut down on spam, so maybe filtering by ASN could cut down even more. The last time I looked into additional means of spam avoidance, the use of SPF wasn't worth the effort.

And I'm afraid the effort of blocking via ASN won't be worth the effort either. Looking over email attempts over the past month, the top 10 networks who sent email to my server, from 5,181 individual emails:

Top 10 emailers to my server
AS Count
IOMART-AS, 375
IDNIC-IDCLOUDHOST-AS-ID 369
PAIR-NETWORKS, 263
MICROSOFT-CORP-MSN-AS-BLOCK, 246
AS-COLOCROSSING, 152
EMERALD-ONION, 124
GOOGLE, 122
SPARKPOST, 120
TZULO, 112
AMAZON-02, 106

Unlike the web (or even Gemini or gopher) there isn't one dominant network here—it's all spread out. I don't think it's really worth the effort to block via ASN for spam. At least for my email server.


Discussions about this entry

Friday, March 21, 2025

Now a bit about feed readers

There are a few bots acting less than optimally that aren't some LLM-based company scraping my site. I think. Anyway, the first one I mentioned:

Identifiers for 8.29.198.26
Agent Requests
Feedly/1.0 (+https://feedly.com/poller.html; 16 subscribers; ) 1667
Feedly/1.0 (+https://feedly.com/poller.html; 6 subscribers; ) 1419
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 6 subscribers; ) 938
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 16 subscribers; ) 811
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 8 subscribers; ) 94
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 37 subscribers; ) 17
Identifiers for 8.29.198.25
Agent Requests
Feedly/1.0 (+https://feedly.com/poller.html; 16 subscribers; ) 1579
Feedly/1.0 (+https://feedly.com/poller.html; 6 subscribers; ) 1481
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 6 subscribers; ) 905
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 16 subscribers; ) 741
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 8 subscribers; ) 90
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 37 subscribers; ) 11

This is feedly, a company that offers a news reader (and I'd like to thank the 67 subscribers I have—thank you). The first issue I have about this client is the apparent redundant requests from six different clients. An issue because I only have three different feeds, the Atom feed, the RSS feed and the the JSON feed. The poller seems to be acting correctly—16 subscribers to my Atom feed and 6 to the RSS feed. The other four? The fetchers? I'm not sure what's going on there. There's one for the RSS feed, and three for the Atom feed. And one of them is a typo—it's requesting “//index.atom” instead of the proper “/index.atom” (but apparently Apache allows it). How do I have 16 subscribers to “/index.atom” and another 37 for “/index.atom”? What exactly, is the difference between the two? And can't you fix the “//index.atom” reference? To me, that's an obvious typo, one that could be verified by retreiving both “/index.atom” and “//index.atom” and seeing they're the same.

Anyway, the second issue I have with feedly is their apparent lack of caching on their end. They do not do a conditional request and while they aren't exactly slamming my server, they are making multiple requests per hour, and for a resource that doesn't change all that often (excluding today that is).

Then there's the bot at IP address 4.231.104.62. It made 43,236 requests to get “/index.atom”, 5 invalid requests in the form of “/gopher://gopher.conman.org/0Phlog:2025/02/…” and one other valid request for this page. It's not the 5 invalid requests or the 1 valid request that has me weirded out—it's the 43,236 to my Atom feed. That's one request every 55 seconds! And even worse—it's not a conditional request! Of all the bots, this is the one I feel most like blocking at the firewall level—just have it drop the packets entirely.

At least it supports compressed results.

Sheesh.

As for the rest—of the 109 bots that fetched the Atom feed at least once per day (I put the cut off at 28 requests or more durring February), only 31 did so conditionally. That's a horrible rate. And of the 31 that did so conditionally, most don't support compression. So on the one hand, the majority of bots that fetch the Atom feed do so compressed. On the other hand, it appears that the bots that do fetch conditionally most don't support compression.

Sigh.


Still no information on who “The Knowledge AI” is or was

Back in July 2019 I was investigating some bad bots on my website when I came across the bot that identified itself simply as “The Knowledge AI” that was the number one robot hitting my site. Most bots that identify themselves will give a URL to a page that describes their usage like Barkrowler (to pick one that recently crawled my site). But not so “The Knowledge AI”. That was all it said, “The Knowledge AI”. It was very hard to Google, but I wouldn’t be surprised if it was OpenAI.

The earliest I can find “The Knowledge AI” crawling my site was April of 2018, and despite starting on April 16th, it was the second most active robot that month. In May it was the number one bot, and it stayed there through October of 2022, after which it pretty much dropped—from 32,000+ in October of 2022 to 85 in November of 2022 (about 4½ years). It was sporadic, showing up in single digit hits until January of 2024. It may be still crawling my site, but if it is, it is no longer identifying itself.

I don’t know if “The Knowledge AI” was an LLM company crawling, but if it was, not giving a link to explain the bot is suspicious. It’s the rare crawler that doesn’t identify itself with at least a URL to describe it. The fact that it took the number one crawling spot on my site for 4 ½ years is suspicious. As robots go, it didn’t affect the web server all that much (I’ve come across worse ones), and well over 90% of its requests were valid (unlike MJ12, which had a 75% failure rate). And my /robots.txt file doesn’t exclude any robot from scanning, so I can’t really complain about it.

My comment on “Mitigating SourceHut's partial outage caused by aggressive crawlers | Lobsters”

Even though the log data is a few years old, I don't think that IPs change from ASN to ASN all that much (but I could be wrong on that). I checked the IPs used by “The Knowledge AI” in May 2018, and in October 2022, and they didn't change that much. They were still the same /24 networks across that time.

Looking up the information today is very disappointing—Hurricane Electric LLC., a backbone provider.

So no real information about who “The Knowledge AI” might have been.

Sigh.


A deeper dive into mapping web requests via ASN, not by IP address

I went ahead and replaced IP addresses with ASNs in the log file to find the network that sent the most requests to my blog for the month of February.

Top 10 networks requesting a page from blog
MICROSOFT-CORP-MSN-AS-BLOCK, US 78889
OVH, FR 31837
ALIBABA-CN-NET Alibaba US Technology Co., Ltd., CN 25019
HETZNER-AS, DE 23840
GOOGLE-CLOUD-PLATFORM, US 21431
CSTL, US 17225
HURRICANE, US 15495
AMAZON-AES, US 14430
FACEBOOK, US 13736
AKAMAI-LINODE-AP Akamai Connected Cloud, SG 12673

Even though Alibaba US has the most unique IPs hitting my blog, Microsoft is still the network making the most requests. So let's see how Microsoft presents itself to my web server. Here are the user agents it sends:

Web agents from the Microsoft Network
agent requests
Go-http-client/2.0 43236
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot) 23978
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/116.0.1938.76 Safari/537.36 7953
Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0 2955
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36; compatible; OAI-SearchBot/1.0; +https://openai.com/searchbot 210
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/1.0; +https://openai.com/bot 161
DuckDuckBot/1.1; (+http://duckduckgo.com/duckduckbot.html) 123
'DuckDuckBot-Https/1.1; (+https://duckduckgo.com/duckduckbot)' 122
Python/3.9 aiohttp/3.10.6 28
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.6478.36 Safari/537.36 14
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.6422.114 Safari/537.36 14
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36 Edg/112.0.1722.68 10
DuckAssistBot/1.2; (+http://duckduckgo.com/duckassistbot.html) 10
DuckAssistBot/1.1; (+http://duckduckgo.com/duckassistbot.html) 10
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36 6
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.6422.143 Safari/537.36 6
python-requests/2.32.3 5
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.6422.142 Safari/537.36 5
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36 4
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:77.0) Gecko/20100101 Firefox/77.0 4
DuckDuckBot-Https/1.1; (+https://duckduckgo.com/duckduckbot) 4
Twingly Recon 3
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot) 3
Mozilla/5.0 (compatible; Twingly Recon; twingly.com) 3
python-requests/2.28.2 2
newspaper/0.9.1 2
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36 2
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534+ (KHTML, like Gecko) BingPreview/1.0b 2
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36 2
http.rb/5.1.1 (Mastodon/4.2.10; +https://trystero.social/) Bot 1
http.rb/5.1.1 (Mastodon/4.2.10; +https://trystero.social/) 1
Mozilla/5.0 (Windows NT 6.1; WOW64) SkypeUriPreview Preview/0.5 skype-url-preview@microsoft.com 1
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 1
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36 1
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36 Edg/112.0.1722.48 1
Mastodon/4.4.0-alpha.2 (http.rb/5.2.0; +https://sns.mszpro.com/) Bot 1
Mastodon/4.4.0-alpha.2 (http.rb/5.2.0; +https://sns.mszpro.com/) 1
Mastodon/4.3.3 (http.rb/5.2.0; +https://the.voiceover.bar/) Bot 1
Mastodon/4.3.3 (http.rb/5.2.0; +https://the.voiceover.bar/) 1
Mastodon/4.3.3 (http.rb/5.2.0; +https://discuss.systems/) Bot 1
Mastodon/4.3.3 (http.rb/5.2.0; +https://discuss.systems/) 1

The top result comes from a single IP address and probably requires a separate post about it, since it's weird and annoying. But the rest—you got Bing, you got OpenAI, you got several Mastodon instances—it seems like most of these are from Microsoft's cloud offering. A mixture of things.

What about Facebook?

Web agents from Facebook
agent requests
meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler) 13497
facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php) 207
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/132.0.0.0 Safari/537.36 12
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36 4
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36 4
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36 4
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:58.0) Gecko/20100101 Firefox/59.0 4
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/132.0.0.0 Safari/537.36 Edg/132.0.0.0 2
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36 2

Hmm … looks like I have a few readers at Facebook, but other than that, nothing terribly interesting.

Alibaba, on the other hand, is frightening. Out of 25,019 requests, it presented 581 different user agents. From looking at what was requested, I don't think it's 500 Chinese people reading my blog—it's defintely bots crawling my site (and amusingly, there are requests to /robots.txt file, but without a proper user agent to go by, it's hard to block it via that file).

I can think of one conclusion here—if you do filter by ASN, it can help tremendously, but it also comes with possibly blocking legitimate traffic.


Discussions about this entry


A different approach to blocking bad webbots by IP address

Web crawlers for LLM-based companies, as well as some specific solutions to blocking them, have been making the rounds in the past few days. I was curious to see just how many were hitting my web site, so I ran a few queries over the log files. To ensure consistent results, I decided to query the log file for last month:

Quick summary of results for February 2025
total requests 468439
unique IPs 24654
Top 10 requests per IP
IP Requests
4.231.104.62 43242
198.100.155.33 26650
66.55.200.246 9057
74.80.208.170 8631
74.80.208.59 8407
216.244.66.239 5998
4.227.36.126 5832
20.171.207.130 5817
8.29.198.26 4946
8.29.198.25 4807

(Note: I'm not concerned about protecting any privacy here—given the number of results, there is no way these are any individual. These are all companies hitting my site, and if companies are mining their data for my information, I'm going to do the same to them. So there.)

But it became apparent that it's hard to determine which requests are coming from a single entity—it's clear that a company can employ a large pool of IP addresses to crawl the web, and it's hard to figure out what IPs are under control of which company.

Or is it?

An idea suddenly hit me—a stray thought from the days when I was wearing a network admin hat I recalled that BGP routing basically knows the network boundaries for networks as it's based on policy routing via ASNs. I wonder if I could map IP addresses to ASNs? A quick search and I found my answer—yes! Within a few minutes, I had converted a list of 24,654 unique IP addresses to 1,490 unique networks, I was then able to rework my initial query to include the ASN (or rather, the human readable version instead of just the number):

Requests per IP/ASN
IP Requests AS
4.231.104.62 43242 MICROSOFT-CORP-MSN-AS-BLOCK, US
198.100.155.33 26650 OVH, FR
66.55.200.246 9057 BIDDEFORD1, US
74.80.208.170 8631 CSTL, US
74.80.208.59 8407 CSTL, US
216.244.66.239 5998 WOW, US
4.227.36.126 5832 MICROSOFT-CORP-MSN-AS-BLOCK, US
20.171.207.130 5817 MICROSOFT-CORP-MSN-AS-BLOCK, US
8.29.198.26 4946 FEEDLY-DEVHD, US
8.29.198.25 4807 FEEDLY-DEVHD, US

Now, I was curious as to how they identified themselves, so I reran the query to include the user agent string. The top eight identified themselves consistently:

Requests per Agent
Agent Requests
Go-http-client/2.0 43236
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/132.0.0.0 Safari/537.36 26650
WF search/Nutch-1.12 9057
Mozilla/5.0 (compatible; ImagesiftBot; +imagesift.com) 8631
Mozilla/5.0 (compatible; ImagesiftBot; +imagesift.com) 8407
Mozilla/5.0 (compatible; DotBot/1.2; +https://opensiteexplorer.org/dotbot; help@moz.com) 5998
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot) 5832
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot) 5817

The last two, however had a changing user agent string:

Identifiers for 8.29.198.26
Agent Requests
Feedly/1.0 (+https://feedly.com/poller.html; 16 subscribers; ) 1667
Feedly/1.0 (+https://feedly.com/poller.html; 6 subscribers; ) 1419
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 6 subscribers; ) 938
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 16 subscribers; ) 811
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 8 subscribers; ) 94
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 37 subscribers; ) 17
Identifiers for 8.29.198.25
Agent Requests
Feedly/1.0 (+https://feedly.com/poller.html; 16 subscribers; ) 1579
Feedly/1.0 (+https://feedly.com/poller.html; 6 subscribers; ) 1481
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 6 subscribers; ) 905
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 16 subscribers; ) 741
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 8 subscribers; ) 90
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 37 subscribers; ) 11

I'm not sure what the difference is between polling and fetching (checking the URLs shows two identical pages, only differing in “Poller” and “Fetcher.” But looking deeper into that is for another post.

The next request I did was to see how many IPs (that hit my site in February) map to a particular ASN, and the top 10 are:

IPs per AS
AS Count
ALIBABA-CN-NET Alibaba US Technology Co., Ltd., CN 4034
AMAZON-02, US 1733
HWCLOUDS-AS-AP HUAWEI CLOUDS, HK 1527
GOOGLE-CLOUD-PLATFORM, US 996
COMCAST-7922, US 895
AMAZON-AES, US 719
TENCENT-NET-AP-CN Tencent Building, Kejizhongyi Avenue, CN 635
MICROSOFT-CORP-MSN-AS-BLOCK, US 615
AS-VULTR, US 599
ATT-INTERNET4, US 472

So Alibaba US crawled my site from 4,034 different IP addresses—I haven't done the query to figure out how many requests each ASN did, but it should be a straightforward thing to just replace IP address with the ASN to get a better count of which company is crawling my site the hardest.

And now I'm thinking, I wonder if instead of a form of ad-hoc banning of single IP addresses, or blocking huge swaths of IP addresses (like 47.0.0.0/8, it might not be better to block per ASN? The IP to ASN mapping service I found makes it quite easy to get the ASN of an IP address (and to map the ASN to an human-readable name), Instead of, for example, blocking 101.32.0.0/16, 119.28.0.0/16, 43.128.0.0/14, 43.153.0.0/16 and 49.51.0.0/16 (which isn't an exaustive list by any means) just block IPs belonging to ASN 132203, otherwise known as “TENCENT-NET-AP-CN Tencent Building, Kejizhongyi Avenue, CN.”

I don't know how effective that idea is, but the IP-to-ASN site I found does offer the information via DNS, so it shouldn't be that hard to do.


Discussions about this entry

Obligatory Picture

One is never too old for sparkly Bunster glasses! Never!

Obligatory Contact Info

Obligatory Feeds

Obligatory Links

Obligatory Miscellaneous

Obligatory AI Disclaimer

No AI was used in the making of this site, unless otherwise noted.

You have my permission to link freely to any entry here. Go ahead, I won't bite. I promise.

The dates are the permanent links to that day's entries (or entry, if there is only one entry). The titles are the permanent links to that entry only. The format for the links are simple: Start with the base link for this site: https://boston.conman.org/, then add the date you are interested in, say 2000/08/01, so that would make the final URL:

https://boston.conman.org/2000/08/01

You can also specify the entire month by leaving off the day portion. You can even select an arbitrary portion of time.

You may also note subtle shading of the links and that's intentional: the “closer” the link is (relative to the page) the “brighter” it appears. It's an experiment in using color shading to denote the distance a link is from here. If you don't notice it, don't worry; it's not all that important.

It is assumed that every brand name, slogan, corporate name, symbol, design element, et cetera mentioned in these pages is a protected and/or trademarked entity, the sole property of its owner(s), and acknowledgement of this status is implied.

Copyright © 1999-2025 by Sean Conner. All Rights Reserved.