The Boston Diaries

Saturday, June 01, 2024

Falettinme Be Mice Elf Ag

Bunny and I were in my car when this song from the Sly and the Family Stone came on. On the screen was the title, “Thank You (Falettinme Be Mice Elf Ag)”. Bunny attempted to read it and was having a hard time with it; I suggested that perhaps XM Radio had used some form of audio transcription that went wrong somewhere, because Sly and the Family were clearly singing “thank you for letting me be myself.”

I was about to make a satirical post about it, but when I was looking up the song … well … that's the actual name! Okay, it's actually “Thank You (Falettinme Be Mice Elf Agin)” (the screen wasn't wide enough for the entire title). But still … that's the actual name of the song! Sheesh!

Monday, June 03, 2024

Just a simple matter of replacing a slow Lua function with a faster C function

I spent the past few days rewriting some Lua code into C. While I find LPEG to be convenient, it is not necessarily fast. Normally this isn't an issue but in this case, I was calling LPEG for each character in a blog post.

Fortunately, it was fairly straight forward porting the code to C. The code goes through text a ~~character~~ codepoint at a time. If it's a whitespace character or a hyphen, I mark the current position as a possible breakpoint for the text; otherwise I ignore combining characters (they don't count towards the line length). Then, when I reach past the number of characters I want for a line, I copy out the string from the beginning of the “line” to the marked breakpoint (and if there isn't a breakpoint, there is no good place to break the line so I will break the line at the line length—not much else to do), then mark the beginning of the next line and continue until the end of the text.

The hardest part was figuring out how to classify each character I needed. In the end, I pull out each Unicode codepoint from UTF-8 and look through an array to classify the codepoint as whitespace, a hyphen or a combining character; if they aren't in the table, it just a normal character.

As a sanity check, I reran the original profiling test:

Lines of Lua code executed to serve a request
gopher (original)	457035
gopher (new)	18246
gemini (just because)	22661

Much better. And most of the 457,035 lines of code being executed are now hidden behind C. Now to make sure the code is actually faster, I profiled the new wrapt() function:

local wraptx = wrapt
local function wrapt(...)
  local start = rdtsc()
  local res   = wraptx(...)
  local stop  = rdtsc()
  syslog('notice',"wrapt()=%d",stop-start)
  return res
end

with the decently sized request I used before (each line is a call to wrapt()):

Runtime (lower is better)
#Lua code	C code
43330	11810
43440	12000
45300	12220
48100	12020
48680	13690
49260	12650
54140	12270
54650	12460
58530	12130
59760	14180
61100	15480
65440	14970
67920	15810
68750	15310
69920	17170
69960	17780
70740	16510
75640	16750
78870	19170
83200	18190
87090	17290
89070	23360
91440	19560
101800	21520
102460	21060
103790	22180
106000	22400
106010	21870
112960	21160
115300	21870
115980	23130
118690	24980
122550	23960
122710	24550
127610	23830
129580	24670
130120	24930
140580	26570
141930	25210
157640	27050
168000	32250

Excellent! The new code is three to five times faster. Now to just sit back and see how the new code fares over the next few days.

Wednesday, June 05, 2024

I wonder who isn't getting the point

Yes, there are a few groups I follow on Discord. One of the groups has a link to another site I want to view and I click on it. I get:

Leaving Discord

This link is taking you to the following website

https://github.com/XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Trust github.com links from now on

God, I just love XXXXXXX wall gardens! I just love how it's there to protect me. You mean … a link … goes to a non-Discord site? Oh my God! Thank you for warning me! My computer might have been pwned by Microsoft!

Sheesh.

Thursday, June 06, 2024

Stats on some optimized small internet servers

I restarted my Gopher and Gemini servers about 25 hours ago to ensure they're both running the new optimized code, so now let's check some results:

Gopher/Gemini server stats
service	CPU utilization	requests	bytes
gopher	2:42	935	1186591
gemini	8:58	36394	249020812

That's more like it. The gopher server, running over plain TCP and getting about 1/40^th the requests, is finally accumulating less CPU time than Gemini (and the only reason it's not even less is my gopher server has to deal with wraping Unicode text).

I also find it amusing that Gemini, a protocol that has only been around for five years, is way more popular than gopher, a protocol that's been around for thirty-three years. I guess Gemini being the new shiny is the reason.

Wednesday, June 12, 2024

Just when I thought it was safe to run a gopher server

It appeared I had traded the problem of high CPU usage for a memory leak. Last night I saw:

USER       PID %CPU %MEM     VSZ     RSS TTY   STAT START TIME COMMAND
gopher   31875  0.2 13.3 2174076 2171960 ttyp4 S+   Jun09 7:08 lua port70.lua /home/spc/gopher/config.lua

(For the record—the VSZ and RSS values are the number of 4,096 byte blocks of memory for various reasons beyond the scope of this post) and just before lunch today:

USER       PID %CPU %MEM     VSZ     RSS TTY   STAT START TIME COMMAND
gopher    1572  0.2 11.1 1809672 1807644 ttyp4 S+   04:30 1:07 lua port70.lua /home/spc/gopher/config.lua

Not good. Especially when it's normally like:

USER       PID %CPU %MEM   VSZ  RSS TTY   STAT START TIME COMMAND
gopher   20711  0.2  0.0 10588 9000 ttyp4 S+   17:52 0:06 lua port70.lua /home/spc/gopher/config.lua

And then there was this in the logs:

Jun 12 09:47:54	daemon	err	71.19.142.20	gopher	CRASH: coroutine thread: 0x8400404 dead: not enough memory
Jun 12 09:47:54	daemon	err	71.19.142.20	gopher	CRASH: thread: 0x8400404: stack traceback:
Jun 12 09:47:54	daemon	err	71.19.142.20	gopher	CRASH: thread: 0x8400404:  [C]: in function 'org.conman.string.wrapt'
Jun 12 09:47:54	daemon	err	71.19.142.20	gopher	CRASH: thread: 0x8400404:  ...e/lua/5.4/org/conman/app/port70/handlers/blog/format.lua:34: in upvalue 'wrap_text'
Jun 12 09:47:54	daemon	err	71.19.142.20	gopher	CRASH: thread: 0x8400404:  ...e/lua/5.4/org/conman/app/port70/handlers/blog/format.lua:119: in upvalue 'run_flow'
Jun 12 09:47:54	daemon	err	71.19.142.20	gopher	CRASH: thread: 0x8400404:  ...e/lua/5.4/org/conman/app/port70/handlers/blog/format.lua:598: in function 'org.conman.app.port70.handlers.blog.format'
Jun 12 09:47:54	daemon	err	71.19.142.20	gopher	CRASH: thread: 0x8400404:  ...al/share/lua/5.4/org/conman/app/port70/handlers/blog.lua:227: in function <...al/share/lua/5.4/org/conman/app/port70/handlers/blog.lua:210>
Jun 12 09:47:54	daemon	err	71.19.142.20	gopher	CRASH: thread: 0x8400404:  (...tail calls...)
Jun 12 09:47:54	daemon	err	71.19.142.20	gopher	CRASH: thread: 0x8400404:  port70.lua:238: in function <port70.lua:205>

Yeah, “not enough memory” while running the new C-based text wrapping function. I added a few checks to see if wrapt() wasn't updating the character indicies correctly (and log if they weren't) then went for lunch. Afterwards, I check and lo', memory was still growing, and none of the checks I added had triggered.

Hmmm. Time to take down the gopher server and debug some code.

Sure enough, I found the one entry that went off the rails. What followed was a few hours of troubleshooting to find out why that one entry (and only that entry) blows up. And it came down to a difference of semantics between Lua and C (no, it had nothing to do with C's 0-based indexing and Lua's 1-based indexing). Basically, the Lua code looked like:

local breakhere -- this is nil

...

if ctype == 'space' then
  breakhere = i -- this is now an integer
  ...
end

if breakhere then
  table.insert(result,s:sub(front,breakhere - 1))
  ...
end

breakhere = nil

In Lua, only the values nil and false are considered “false”—a 0 value is “true” in Lua. In C, any 0 value is considered “false.” I knew this when translating the code, but it never occurred to me that a break point of a line could be at the start of a line.

Until it was at the start of a line in that one entry—the code went into an infinite loop trying to wrap text, thus causing the memory usage to consume the computer.

Sigh.

It was easy enough to fix once the problem was diagnosed. And maybe now things will get back to normal.

Monday, June 17, 2024

When a file doesn't exist, yet it's right there

For reasons, I'm downloading a file from several hundred different domains. The results start trickling in and I decide to take a look at one of them:

[spc]lucy:/tmp/cache>ls 04d.co
04d.co
[spc]lucy:/tmp/cache>more 04d.co
Error: File or directory not found!
[spc]lucy:/tmp/cache>

Um … what?

Okay. There could be an invisible character or two in the filename.

[spc]lucy:/tmp/cache>echo * | hex | more
00000000: 30 34 64 2E 63 6F 20 31 34 33 36 2E 6E 69 6E 6A 04d.co 1436.ninj
...
[spc]lucy:/tmp/cache>

Nope. But what could cause the shell not to find the file when the file is right there! The name shouldn't matter. But I can't view it.

[spc]lucy:/tmp/cache>cat 04d.co
Error: File or directory not found!
[spc]lucy:/tmp/cache>

Okay, what if I try to move it somewhere?

[spc]lucy:~/tmp>cp /tmp/cache/04d.co .
[spc]lucy:~/tmp>more 04d.co
Error: File or directory not found!
[spc]lucy:~/tmp>

I can list the file:

[spc]lucy:~/tmp>ls -l 04d.co
-rw-r--r--  1 spc spc 37 Jun 17 22:25 04d.co
[spc]lucy:~/tmp>

Let me try one last thing …

[spc]lucy:~/tmp>hex 04d.co
00000000: 45 72 72 6F 72 3A 20 46 69 6C 65 20 6F 72 20 64 Error: File or d
00000010: 69 72 65 63 74 6F 72 79 20 6E 6F 74 20 66 6F 75 irectory not fou
00000020: 6E 64 21 0D 0A                                  nd!..

I was downloading these files from the Intenet. This particular site didn't have the file. The contents of the file is the error message.

Head. Meet desk.

This reminds me of college. Back then, the command finger was popular, and when using it, the program would attempt to locate a file called .plan in the home directory and display it. I made a joke .plan file that read:

Bus error - core dumped

You have new mail.

I can't count the number of times someone tried to finger me, only to attempt to remove a non-existent core file and impulsively check their email.

Sunday, June 23, 2024

How does TLS use less CPU than plain TCP, part II

I noticed over the past couple of days that the CPU utilization were similar between the two services, but looking at the logs it seemed my gopher server git hit with several bots. So last night just as the logs reset (a weekly job) I decided to restart the two services so as to match CPU utilization and requests. Twelve hour later and …

CPU utilization after 12 hours
	CPU	requests
gopher	0:28	175
gemini	0:26	1744

A week and a half after calling “Mission Complete” and apparently not. Even worse, my gopher server is using a bit more CPU than Gemini server while getting one-tenth the number of requests. Something is off here.

I checked the requests and both receive over 90% requests for my blog, so I'm confident that it's something to do with that. Both use the same request framework, both use the same code to read blog entries, both use the same code to parse the HTML, so it's not that. The only difference left is the formatting of each entry. The gopher server does wrap the text to 80 columns (not needed for the Gemini server), and there's quite a bit more code to deal with fixing the links … are those the causes? It's hard to determine what's causing the excessive CPU utilization.

I'm going to have to think about how to determine the issue.

Heck, I'm going to have to think if this is even worth determining. I mean, the CPU utilization is better, but it's still puzzling.

The case of the weird profiling results

Okay, I'm flummoxed.

After a bit of thought, I had an idea on how to determine where the CPU time is going on my gopher server—I have at least two working hypotheses that I can test, wraping text and translating links. So, to test these, I will get a list of requests since the logs rotated earlier today, and then run them through the gopher server (a second one set up on my public server just for these tests) to get a baseline; then remove the link translations and run the requests again; then remove wraping the text and run the requests a third time and see what I get.

For the first test, the expected results should result with the test gopher server having the same CPU utilization as the one that normally runs. It makes sense right—the normal one has x CPU utilization with y requests, so if I feed the test gopher server the same requests (even though all the requests come from the same location) it should have the same CPU utilization. Or at least close enough to show that I'm on the right track.

When I pulled the requests out of the logs files and checked the current gopher server, it had received 332 requests and racked up 0:47 in CPU time. I set up the test gopher server (the only changes to the configuration—different log file and a different port which meant I didn't have to run as root). I made the 332 requests to my test gopher server and I get a CPU time of 0:22.

What?

I reran this test a few times and got the same results each time—0:22.

Which is less than half the time the normal gopher server with the same number of requests. If it was one or two seconds off, hey, close enough. But half?

It's not making sense.

But I decided to coninue and run the other tests. First, I removed the code that does the link translation, rerand the requests and got a CPU time of 0:20.

Okay, that tell me two things—one, the link translations do take time, but I don't think it's enough to explain the CPU utilization of the gopher server. Maybe. On a whim, I decided to change the third test to not even bother with processing blog entries—I modified the blog entry handler to just return. Given that ¾ of the requests are to the blog, this should make it run much faster and use less CPU time.

I got a CPU time of 0:18.

I don't know what to think. I would have expected this to be 0:05 or 0:06, given that 75% of the requests would not be generated. Something weird is going on.

It was time to take a step back. I went back to the original bench mark but instead of counting lines executed, I decided to count Lua VM instructions for the decently sized request.

Lua VM cycles to serve a request
gopher	61622
gemini	67401

Okay, the gopher server is clearly doing less Lua VM instructions than the Gemini server. Could I gain any insight from profiling at the C level? I had already done most of the work to profile both the gopher and Gemini servers. To make sure I got enough data, I ran the decently sized request three times for each server.

port70 (gopher server)—Each sample counts as 0.01 seconds.
% time	cumulative seconds	self seconds	calls	self seconds ms/call	total ms/call	name
% time	cumulative seconds	self seconds	calls	self seconds ms/call	total ms/call	name
13.79	0.04	0.04	185781	0.00	0.00	luaV_execute
10.34	0.07	0.03	734588	0.00	0.00	index2value
6.90	0.09	0.02	398225	0.00	0.00	luaS_new
6.90	0.11	0.02	45028	0.00	0.00	luaH_newkey
3.45	0.12	0.01	1041939	0.00	0.00	yymatchChar
3.45	0.13	0.01	924394	0.00	0.00	luaG_traceexec
3.45	0.14	0.01	503889	0.00	0.00	yyText
3.45	0.15	0.01	260252	0.00	0.00	luaD_precall
3.45	0.16	0.01	245893	0.00	0.00	mainpositionTV
3.45	0.17	0.01	201753	0.00	0.00	auxgetstr
3.45	0.18	0.01	191931	0.00	0.00	yy_S
3.45	0.19	0.01	185373	0.00	0.00	equalkey
3.45	0.20	0.01	134088	0.00	0.00	yyDo
3.45	0.21	0.01	129432	0.00	0.00	yy_CHAR
3.45	0.22	0.01	101937	0.00	0.00	reverse
3.45	0.23	0.01	34759	0.00	0.00	luaH_getn
3.45	0.24	0.01	4473	0.00	0.00	getfirst
3.45	0.25	0.01	1625	0.01	0.01	traverseproto
3.45	0.26	0.01	834	0.01	0.01	strcore_wrapt
3.45	0.27	0.01	61	0.16	0.16	checkloops
3.45	0.28	0.01	2	5.00	5.00	deletelist
3.45	0.29	0.01				cclasscmp
0.00	0.29	0.00	924377	0.00	0.00	luaD_hook
0.00	0.29	0.00	391575	0.00	0.00	yymatchString
0.00	0.29	0.00	358374	0.00	0.00	luaH_getshortstr
0.00	0.29	0.00	261889	0.00	0.00	prepCallInfo
0.00	0.29	0.00	261296	0.00	0.00	luaD_poscall

GLV-1.12556 (Gemini server)—Each sample counts as 0.01 seconds.
% time	cumulative seconds	self seconds	calls	self seconds ms/call	total ms/call	name
% time	cumulative seconds	self seconds	calls	self seconds ms/call	total ms/call	name
9.38	0.03	0.03	1011065	0.00	0.00	luaG_traceexec
9.38	0.06	0.03	1011056	0.00	0.00	luaD_hook
7.81	0.09	0.03	204707	0.00	0.00	luaV_execute
6.25	0.10	0.02	443861	0.00	0.00	luaS_new
6.25	0.12	0.02	396470	0.00	0.00	luaH_getshortstr
3.12	0.14	0.01	668980	0.00	0.00	index2value
3.12	0.14	0.01	391575	0.00	0.00	yymatchString
3.12	0.15	0.01	271008	0.00	0.00	mainpositionTV
3.12	0.17	0.01	243043	0.00	0.00	luaD_precall
3.12	0.17	0.01	242834	0.00	0.00	moveresults
3.12	0.18	0.01	217137	0.00	0.00	ccall
3.12	0.20	0.01	202203	0.00	0.00	hookf
3.12	0.20	0.01	129432	0.00	0.00	yy_CHAR
3.12	0.21	0.01	56698	0.00	0.00	llex
3.12	0.23	0.01	39543	0.00	0.00	internshrstr
3.12	0.23	0.01	30301	0.00	0.00	luaM_malloc_
3.12	0.24	0.01	23821	0.00	0.00	luaH_realasize
3.12	0.26	0.01	5906	0.00	0.00	luaV_concat
3.12	0.27	0.01	4149	0.00	0.00	GCTM
3.12	0.28	0.01	1872	0.01	0.02	yy_PCDATA
3.12	0.28	0.01	928	0.01	0.01	match
3.12	0.29	0.01	787	0.01	0.01	luaF_newLclosure
3.12	0.30	0.01	595	0.02	0.02	luaK_int
3.12	0.32	0.01	59	0.17	0.17	l_strcmp
1.56	0.32	0.01	12	0.42	0.42	luaV_finishOp
0.00	0.32	0.00	1041939	0.00	0.00	yymatchChar

This is not easy to interpret. As expected, the Lua VM shows up in the top spots for both, but nothing really stands out. It is nice to see that yymatchChar and yymatchString (both in the HTML parsing module) are called the same number of times (expected) but the times are different. The Lua function index2value is called a different number of times, but that might be due to code differences. I think the higher percentage of time in the gopher server might be due to taking less time overall? For instance, in the gopher server, deletelist has a self ms/call of 5, but on the Gemini server it has a self ms/call of 0.52—is the garbage collector being called more often in the gopher server? Or is it because of less time overall, it shows up with a higher time? It's only called twice in both codebases. I'm not sure how to interpret this.

Somehow, the gopher server is doing less work than the Gemini server, yet accumulating more CPU time than the Gemini server, despite getting about 10% of the requests as the Gemini server. I don't understand how that can be.

I think I'll leave things alone for now. I went several years with a not-optimized gopher server, so I think I can live with this new version for now.

Thursday, June 27, 2024

IT HAS BEGUN! (“It” being Christmas themed advertising)

Last night at the Cracker Barrel, Bunny and I saw store displays nominally for Halloween. But nestled in the middle of the display was an evergreen tree with ornaments hanging off it. Try as you might, Cracker Barrel, there is no such thing as a Halloween Tree (no disrepect, Mr. Bradury)—it's still a Chrismtas Tree.

And then today I saw what is to me, the first official advertisement for Christmas—a trailer for a movie called “Red One,” a muddy looking CGI slugfest where Santa Clause has been kidnapped and needs to be saved.

The movie isn't even slated to be released until November! Couldn't they have at least waited until October to start advertising it? No?

Sigh.

Sunday, June 30, 2024

Four lines of code … it was four lines of code

It was bugging me.

I checked last night, and found the following:

CPU utilization after a week
	CPU	requests
gopher	28:37	8360
gemini	6:02	24345

How can a TCP service take more time than a TLS service? Especially when TLS runs over TCP? And TLS generates six times the packets than TCP?

There's some difference in how I handle TCP and TLS that's causing this. Profiling the code didn't reveal much as my attempts kept showing that gopher should be using less CPU than Gemini, so then it was just a matter of staring at code and thinking hard.

It wasn't easy. Yes, there are some differences between the TCP interface and the TLS interface, so it was a matter of going through the code, bringing them closer in implementation and seeing if that fixed things. I would fix the code, run a test, see it not make a difference, and try again. Eventually, I came across this bit of code:

-- ---------------------------------------------------------------------
-- It might be a bug *somewhere*, but on Linux, this is required to get
-- Unix domain sockets to work with the NFL driver.  There's a race
-- condition where writting data then calling close() may cause the
-- other side to receive no data.  This does NOT appoear to happen with
-- TCP sockets, but this doesn't hurt the TCP side in any case.
-- ---------------------------------------------------------------------
    
while self.__socket.sendqueue and ios.__socket.sendqueue > 0 do
  nfl.SOCKETS:update(self.__socket,'w')
  coroutine.yield()
end

This bit of code wasn't in the TLS implementation, and as the comment says, it “shouldn't” hurt the TCP path, but hey, it was only needed for local (or Unix) domain sockets to begin with, so let me try removing that bit of code and run a test.

Test before removing the code, 275 requests; time in seconds
	CPU	real	user	sys
gopher	18	46.726	13.971	4.059
gemini	4	107.928	3.951	0.322

Test after removing the code, 275 requests; time in seconds
	CPU	real	user	sys
gopher	0	52.403	0.635	0.185
gemini	4	103.290	3.957	0.285

THAT'S more like it!

Now, about that code … this is my best guess as to what was going on.

The sendqueue field is a Linux-only field, so I'm checking to see if it exists, and then checking the value (if only to avoid the “attempt to compare number with nil” Lua error on non-Linux systems). This returns the number of bytes the kernel has yet to send, so we switch the socket to trigger an event when we can write to it, and yield. Even worse, the check isn't just a simple variable fetch, but ends up being an ioctl() system call. Twice!

Now the “write” event code:

if event.write then
  if #ios.__output > 0 then
    local bytes,err = ios.__socket:send(nil,ios.__output)
    if err == 0 then
      ios.__wbytes = ios.__wbytes + bytes
      ios.__output = ios.__output:sub(bytes + 1,-1)
    else
      syslog('error',"socket:send() = %s",errno[err])
      nfl.SOCKETS:remove(ios.__socket)
      nfl.schedule(ios.__co,false,errno[err],err)
    end
  end
  
  if #ios.__output == 0 then
    nfl.SOCKETS:update(ios.__socket,'r')
    nfl.schedule(ios.__co,true)
  end
end

There's probably no data to be sent (ios.__output is empty) so we immediately switch the socket back to trigger for readability, then resume the coroutine handling the socket. It goes back to the while loop, setting the socket back to trigger on writability, and kept bouncing like that until the kernel buffer had been sent.

As I wrote, this fixed an issue I had with local domain sockets—it wasn't an issue with an IP socket, and in fact, was the wrong thing to do for an IP socket.

And as it is with these types of bugs, finding the root cause is not trivial, but the fix almost always is.