Saturday, June 01, 2024
Falettinme Be Mice Elf Ag
Bunny and I were in my car when this song from the Sly and the Family Stone came on. On the screen was the title, “Thank You (Falettinme Be Mice Elf Ag)”. Bunny attempted to read it and was having a hard time with it; I suggested that perhaps XM Radio had used some form of audio transcription that went wrong somewhere, because Sly and the Family were clearly singing “thank you for letting me be myself.”
I was about to make a satirical post about it, but when I was looking up the song … well … that's the actual name! Okay, it's actually “Thank You (Falettinme Be Mice Elf Agin)” (the screen wasn't wide enough for the entire title). But still … that's the actual name of the song! Sheesh!
Monday, June 03, 2024
Just a simple matter of replacing a slow Lua function with a faster C function
I spent the past few days rewriting some Lua code into C. While I find LPEG to be convenient, it is not necessarily fast. Normally this isn't an issue but in this case, I was calling LPEG for each character in a blog post.
Fortunately,
it was fairly straight forward porting the code to C.
The code goes through text a character codepoint at a time.
If it's a whitespace character or a hyphen,
I mark the current position as a possible breakpoint for the text;
otherwise I ignore combining characters
(they don't count towards the line length).
Then,
when I reach past the number of characters I want for a line,
I copy out the string from the beginning of the “line” to the marked breakpoint
(and if there isn't a breakpoint,
there is no good place to break the line so I will break the line at the line length—not much else to do),
then mark the beginning of the next line and continue until the end of the text.
The hardest part was figuring out how to classify each character I needed. In the end, I pull out each Unicode codepoint from UTF-8 and look through an array to classify the codepoint as whitespace, a hyphen or a combining character; if they aren't in the table, it just a normal character.
As a sanity check, I reran the original profiling test:
gopher (original) | 457035 |
gopher (new) | 18246 |
gemini (just because) | 22661 |
Much better.
And most of the 457,035 lines of code being executed are now hidden behind C.
Now to make sure the code is actually faster,
I profiled the new wrapt()
function:
local wraptx = wrapt local function wrapt(...) local start = rdtsc() local res = wraptx(...) local stop = rdtsc() syslog('notice',"wrapt()=%d",stop-start) return res end
with the decently sized request I used before (each line is a call to wrapt()
):
#Lua code | C code |
43330 | 11810 |
43440 | 12000 |
45300 | 12220 |
48100 | 12020 |
48680 | 13690 |
49260 | 12650 |
54140 | 12270 |
54650 | 12460 |
58530 | 12130 |
59760 | 14180 |
61100 | 15480 |
65440 | 14970 |
67920 | 15810 |
68750 | 15310 |
69920 | 17170 |
69960 | 17780 |
70740 | 16510 |
75640 | 16750 |
78870 | 19170 |
83200 | 18190 |
87090 | 17290 |
89070 | 23360 |
91440 | 19560 |
101800 | 21520 |
102460 | 21060 |
103790 | 22180 |
106000 | 22400 |
106010 | 21870 |
112960 | 21160 |
115300 | 21870 |
115980 | 23130 |
118690 | 24980 |
122550 | 23960 |
122710 | 24550 |
127610 | 23830 |
129580 | 24670 |
130120 | 24930 |
140580 | 26570 |
141930 | 25210 |
157640 | 27050 |
168000 | 32250 |
Excellent! The new code is three to five times faster. Now to just sit back and see how the new code fares over the next few days.
Wednesday, June 05, 2024
I wonder who isn't getting the point
Yes, there are a few groups I follow on Discord. One of the groups has a link to another site I want to view and I click on it. I get:
Leaving Discord
This link is taking you to the following website
https://github.com/XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
Trust
github.com
links from now on
God, I just love XXXXXXX wall gardens! I just love how it's there to protect me. You mean … a link … goes to a non-Discord site? Oh my God! Thank you for warning me! My computer might have been pwned by Microsoft!
Sheesh.
Thursday, June 06, 2024
Stats on some optimized small internet servers
I restarted my Gopher and Gemini servers about 25 hours ago to ensure they're both running the new optimized code, so now let's check some results:
service | CPU utilization | requests | bytes |
---|---|---|---|
gopher | 2:42 | 935 | 1186591 |
gemini | 8:58 | 36394 | 249020812 |
That's more like it. The gopher server, running over plain TCP and getting about 1/40th the requests, is finally accumulating less CPU time than Gemini (and the only reason it's not even less is my gopher server has to deal with wraping Unicode text).
I also find it amusing that Gemini, a protocol that has only been around for five years, is way more popular than gopher, a protocol that's been around for thirty-three years. I guess Gemini being the new shiny is the reason.
Wednesday, June 12, 2024
Just when I thought it was safe to run a gopher server
It appeared I had traded the problem of high CPU usage for a memory leak. Last night I saw:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND gopher 31875 0.2 13.3 2174076 2171960 ttyp4 S+ Jun09 7:08 lua port70.lua /home/spc/gopher/config.lua
(For the record—the VSZ
and RSS
values are the number of 4,096 byte blocks of memory for various reasons beyond the scope of this post)
and just before lunch today:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND gopher 1572 0.2 11.1 1809672 1807644 ttyp4 S+ 04:30 1:07 lua port70.lua /home/spc/gopher/config.lua
Not good. Especially when it's normally like:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND gopher 20711 0.2 0.0 10588 9000 ttyp4 S+ 17:52 0:06 lua port70.lua /home/spc/gopher/config.lua
And then there was this in the logs:
Jun 12 09:47:54 daemon err 71.19.142.20 gopher CRASH: coroutine thread: 0x8400404 dead: not enough memory Jun 12 09:47:54 daemon err 71.19.142.20 gopher CRASH: thread: 0x8400404: stack traceback: Jun 12 09:47:54 daemon err 71.19.142.20 gopher CRASH: thread: 0x8400404: [C]: in function 'org.conman.string.wrapt' Jun 12 09:47:54 daemon err 71.19.142.20 gopher CRASH: thread: 0x8400404: ...e/lua/5.4/org/conman/app/port70/handlers/blog/format.lua:34: in upvalue 'wrap_text' Jun 12 09:47:54 daemon err 71.19.142.20 gopher CRASH: thread: 0x8400404: ...e/lua/5.4/org/conman/app/port70/handlers/blog/format.lua:119: in upvalue 'run_flow' Jun 12 09:47:54 daemon err 71.19.142.20 gopher CRASH: thread: 0x8400404: ...e/lua/5.4/org/conman/app/port70/handlers/blog/format.lua:598: in function 'org.conman.app.port70.handlers.blog.format' Jun 12 09:47:54 daemon err 71.19.142.20 gopher CRASH: thread: 0x8400404: ...al/share/lua/5.4/org/conman/app/port70/handlers/blog.lua:227: in function <...al/share/lua/5.4/org/conman/app/port70/handlers/blog.lua:210> Jun 12 09:47:54 daemon err 71.19.142.20 gopher CRASH: thread: 0x8400404: (...tail calls...) Jun 12 09:47:54 daemon err 71.19.142.20 gopher CRASH: thread: 0x8400404: port70.lua:238: in function <port70.lua:205>
Yeah, “not enough memory” while running the new C-based text wrapping function.
I added a few checks to see if wrapt()
wasn't updating the character indicies correctly
(and log if they weren't)
then went for lunch.
Afterwards,
I check and lo',
memory was still growing,
and none of the checks I added had triggered.
Hmmm. Time to take down the gopher server and debug some code.
Sure enough, I found the one entry that went off the rails. What followed was a few hours of troubleshooting to find out why that one entry (and only that entry) blows up. And it came down to a difference of semantics between Lua and C (no, it had nothing to do with C's 0-based indexing and Lua's 1-based indexing). Basically, the Lua code looked like:
local breakhere -- this is nil ... if ctype == 'space' then breakhere = i -- this is now an integer ... end if breakhere then table.insert(result,s:sub(front,breakhere - 1)) ... end breakhere = nil
In Lua,
only the values nil
and false
are considered “false”—a 0 value is “true” in Lua.
In C,
any 0 value is considered “false.”
I knew this when translating the code,
but it never occurred to me that a break point of a line could be at the start of a line.
Until it was at the start of a line in that one entry—the code went into an infinite loop trying to wrap text, thus causing the memory usage to consume the computer.
Sigh.
It was easy enough to fix once the problem was diagnosed. And maybe now things will get back to normal.
Monday, June 17, 2024
When a file doesn't exist, yet it's right there
For reasons, I'm downloading a file from several hundred different domains. The results start trickling in and I decide to take a look at one of them:
[spc]lucy:/tmp/cache>ls 04d.co 04d.co [spc]lucy:/tmp/cache>more 04d.co Error: File or directory not found! [spc]lucy:/tmp/cache>
Um … what?
Okay. There could be an invisible character or two in the filename.
[spc]lucy:/tmp/cache>echo * | hex | more 00000000: 30 34 64 2E 63 6F 20 31 34 33 36 2E 6E 69 6E 6A 04d.co 1436.ninj ... [spc]lucy:/tmp/cache>
Nope. But what could cause the shell not to find the file when the file is right there! The name shouldn't matter. But I can't view it.
[spc]lucy:/tmp/cache>cat 04d.co Error: File or directory not found! [spc]lucy:/tmp/cache>
Okay, what if I try to move it somewhere?
[spc]lucy:~/tmp>cp /tmp/cache/04d.co . [spc]lucy:~/tmp>more 04d.co Error: File or directory not found! [spc]lucy:~/tmp>
I can list the file:
[spc]lucy:~/tmp>ls -l 04d.co -rw-r--r-- 1 spc spc 37 Jun 17 22:25 04d.co [spc]lucy:~/tmp>
Let me try one last thing …
[spc]lucy:~/tmp>hex 04d.co 00000000: 45 72 72 6F 72 3A 20 46 69 6C 65 20 6F 72 20 64 Error: File or d 00000010: 69 72 65 63 74 6F 72 79 20 6E 6F 74 20 66 6F 75 irectory not fou 00000020: 6E 64 21 0D 0A nd!..
I was downloading these files from the Intenet. This particular site didn't have the file. The contents of the file is the error message.
Head. Meet desk.
This reminds me of college.
Back then,
the command finger
was popular,
and when using it,
the program would attempt to locate a file called .plan
in the home directory and display it.
I made a joke .plan
file that read:
Bus error - core dumped You have new mail.
I can't count the number of times someone tried to finger
me,
only to attempt to remove a non-existent core file and impulsively check their email.
Sunday, June 23, 2024
How does TLS use less CPU than plain TCP, part II
I noticed over the past couple of days that the CPU utilization were similar between the two services, but looking at the logs it seemed my gopher server git hit with several bots. So last night just as the logs reset (a weekly job) I decided to restart the two services so as to match CPU utilization and requests. Twelve hour later and …
CPU | requests | |
---|---|---|
gopher | 0:28 | 175 |
gemini | 0:26 | 1744 |
A week and a half after calling “Mission Complete” and apparently not. Even worse, my gopher server is using a bit more CPU than Gemini server while getting one-tenth the number of requests. Something is off here.
I checked the requests and both receive over 90% requests for my blog, so I'm confident that it's something to do with that. Both use the same request framework, both use the same code to read blog entries, both use the same code to parse the HTML, so it's not that. The only difference left is the formatting of each entry. The gopher server does wrap the text to 80 columns (not needed for the Gemini server), and there's quite a bit more code to deal with fixing the links … are those the causes? It's hard to determine what's causing the excessive CPU utilization.
I'm going to have to think about how to determine the issue.
Heck, I'm going to have to think if this is even worth determining. I mean, the CPU utilization is better, but it's still puzzling.
The case of the weird profiling results
Okay, I'm flummoxed.
After a bit of thought, I had an idea on how to determine where the CPU time is going on my gopher server—I have at least two working hypotheses that I can test, wraping text and translating links. So, to test these, I will get a list of requests since the logs rotated earlier today, and then run them through the gopher server (a second one set up on my public server just for these tests) to get a baseline; then remove the link translations and run the requests again; then remove wraping the text and run the requests a third time and see what I get.
For the first test, the expected results should result with the test gopher server having the same CPU utilization as the one that normally runs. It makes sense right—the normal one has x CPU utilization with y requests, so if I feed the test gopher server the same requests (even though all the requests come from the same location) it should have the same CPU utilization. Or at least close enough to show that I'm on the right track.
When I pulled the requests out of the logs files and checked the current gopher server, it had received 332 requests and racked up 0:47 in CPU time. I set up the test gopher server (the only changes to the configuration—different log file and a different port which meant I didn't have to run as root). I made the 332 requests to my test gopher server and I get a CPU time of 0:22.
What?
I reran this test a few times and got the same results each time—0:22.
Which is less than half the time the normal gopher server with the same number of requests. If it was one or two seconds off, hey, close enough. But half?
It's not making sense.
But I decided to coninue and run the other tests. First, I removed the code that does the link translation, rerand the requests and got a CPU time of 0:20.
Okay, that tell me two things—one, the link translations do take time, but I don't think it's enough to explain the CPU utilization of the gopher server. Maybe. On a whim, I decided to change the third test to not even bother with processing blog entries—I modified the blog entry handler to just return. Given that ¾ of the requests are to the blog, this should make it run much faster and use less CPU time.
I got a CPU time of 0:18.
I don't know what to think. I would have expected this to be 0:05 or 0:06, given that 75% of the requests would not be generated. Something weird is going on.
It was time to take a step back. I went back to the original bench mark but instead of counting lines executed, I decided to count Lua VM instructions for the decently sized request.
gopher | 61622 |
gemini | 67401 |
Okay, the gopher server is clearly doing less Lua VM instructions than the Gemini server. Could I gain any insight from profiling at the C level? I had already done most of the work to profile both the gopher and Gemini servers. To make sure I got enough data, I ran the decently sized request three times for each server.
% time | cumulative seconds | self seconds | calls | self seconds ms/call | total ms/call | name |
---|---|---|---|---|---|---|
% time | cumulative seconds | self seconds | calls | self seconds ms/call | total ms/call | name |
13.79 | 0.04 | 0.04 | 185781 | 0.00 | 0.00 | luaV_execute |
10.34 | 0.07 | 0.03 | 734588 | 0.00 | 0.00 | index2value |
6.90 | 0.09 | 0.02 | 398225 | 0.00 | 0.00 | luaS_new |
6.90 | 0.11 | 0.02 | 45028 | 0.00 | 0.00 | luaH_newkey |
3.45 | 0.12 | 0.01 | 1041939 | 0.00 | 0.00 | yymatchChar |
3.45 | 0.13 | 0.01 | 924394 | 0.00 | 0.00 | luaG_traceexec |
3.45 | 0.14 | 0.01 | 503889 | 0.00 | 0.00 | yyText |
3.45 | 0.15 | 0.01 | 260252 | 0.00 | 0.00 | luaD_precall |
3.45 | 0.16 | 0.01 | 245893 | 0.00 | 0.00 | mainpositionTV |
3.45 | 0.17 | 0.01 | 201753 | 0.00 | 0.00 | auxgetstr |
3.45 | 0.18 | 0.01 | 191931 | 0.00 | 0.00 | yy_S |
3.45 | 0.19 | 0.01 | 185373 | 0.00 | 0.00 | equalkey |
3.45 | 0.20 | 0.01 | 134088 | 0.00 | 0.00 | yyDo |
3.45 | 0.21 | 0.01 | 129432 | 0.00 | 0.00 | yy_CHAR |
3.45 | 0.22 | 0.01 | 101937 | 0.00 | 0.00 | reverse |
3.45 | 0.23 | 0.01 | 34759 | 0.00 | 0.00 | luaH_getn |
3.45 | 0.24 | 0.01 | 4473 | 0.00 | 0.00 | getfirst |
3.45 | 0.25 | 0.01 | 1625 | 0.01 | 0.01 | traverseproto |
3.45 | 0.26 | 0.01 | 834 | 0.01 | 0.01 | strcore_wrapt |
3.45 | 0.27 | 0.01 | 61 | 0.16 | 0.16 | checkloops |
3.45 | 0.28 | 0.01 | 2 | 5.00 | 5.00 | deletelist |
3.45 | 0.29 | 0.01 | cclasscmp | |||
0.00 | 0.29 | 0.00 | 924377 | 0.00 | 0.00 | luaD_hook |
0.00 | 0.29 | 0.00 | 391575 | 0.00 | 0.00 | yymatchString |
0.00 | 0.29 | 0.00 | 358374 | 0.00 | 0.00 | luaH_getshortstr |
0.00 | 0.29 | 0.00 | 261889 | 0.00 | 0.00 | prepCallInfo |
0.00 | 0.29 | 0.00 | 261296 | 0.00 | 0.00 | luaD_poscall |
% time | cumulative seconds | self seconds | calls | self seconds ms/call | total ms/call | name |
---|---|---|---|---|---|---|
% time | cumulative seconds | self seconds | calls | self seconds ms/call | total ms/call | name |
9.38 | 0.03 | 0.03 | 1011065 | 0.00 | 0.00 | luaG_traceexec |
9.38 | 0.06 | 0.03 | 1011056 | 0.00 | 0.00 | luaD_hook |
7.81 | 0.09 | 0.03 | 204707 | 0.00 | 0.00 | luaV_execute |
6.25 | 0.10 | 0.02 | 443861 | 0.00 | 0.00 | luaS_new |
6.25 | 0.12 | 0.02 | 396470 | 0.00 | 0.00 | luaH_getshortstr |
3.12 | 0.14 | 0.01 | 668980 | 0.00 | 0.00 | index2value |
3.12 | 0.14 | 0.01 | 391575 | 0.00 | 0.00 | yymatchString |
3.12 | 0.15 | 0.01 | 271008 | 0.00 | 0.00 | mainpositionTV |
3.12 | 0.17 | 0.01 | 243043 | 0.00 | 0.00 | luaD_precall |
3.12 | 0.17 | 0.01 | 242834 | 0.00 | 0.00 | moveresults |
3.12 | 0.18 | 0.01 | 217137 | 0.00 | 0.00 | ccall |
3.12 | 0.20 | 0.01 | 202203 | 0.00 | 0.00 | hookf |
3.12 | 0.20 | 0.01 | 129432 | 0.00 | 0.00 | yy_CHAR |
3.12 | 0.21 | 0.01 | 56698 | 0.00 | 0.00 | llex |
3.12 | 0.23 | 0.01 | 39543 | 0.00 | 0.00 | internshrstr |
3.12 | 0.23 | 0.01 | 30301 | 0.00 | 0.00 | luaM_malloc_ |
3.12 | 0.24 | 0.01 | 23821 | 0.00 | 0.00 | luaH_realasize |
3.12 | 0.26 | 0.01 | 5906 | 0.00 | 0.00 | luaV_concat |
3.12 | 0.27 | 0.01 | 4149 | 0.00 | 0.00 | GCTM |
3.12 | 0.28 | 0.01 | 1872 | 0.01 | 0.02 | yy_PCDATA |
3.12 | 0.28 | 0.01 | 928 | 0.01 | 0.01 | match |
3.12 | 0.29 | 0.01 | 787 | 0.01 | 0.01 | luaF_newLclosure |
3.12 | 0.30 | 0.01 | 595 | 0.02 | 0.02 | luaK_int |
3.12 | 0.32 | 0.01 | 59 | 0.17 | 0.17 | l_strcmp |
1.56 | 0.32 | 0.01 | 12 | 0.42 | 0.42 | luaV_finishOp |
0.00 | 0.32 | 0.00 | 1041939 | 0.00 | 0.00 | yymatchChar |
This is not easy to interpret.
As expected,
the Lua VM shows up in the top spots for both,
but nothing really stands out.
It is nice to see that yymatchChar
and yymatchString
(both in the HTML parsing module)
are called the same number of times
(expected)
but the times are different.
The Lua function index2value
is called a different number of times,
but that might be due to code differences.
I think the higher percentage of time in the gopher server might be due to taking less time overall?
For instance,
in the gopher server,
deletelist
has a self ms/call of 5,
but on the Gemini server it has a self ms/call of 0.52—is the garbage collector being called more often in the gopher server?
Or is it because of less time overall,
it shows up with a higher time?
It's only called twice in both codebases.
I'm not sure how to interpret this.
Somehow, the gopher server is doing less work than the Gemini server, yet accumulating more CPU time than the Gemini server, despite getting about 10% of the requests as the Gemini server. I don't understand how that can be.
I think I'll leave things alone for now. I went several years with a not-optimized gopher server, so I think I can live with this new version for now.
Thursday, June 27, 2024
IT HAS BEGUN! (“It” being Christmas themed advertising)
Last night at the Cracker Barrel, Bunny and I saw store displays nominally for Halloween. But nestled in the middle of the display was an evergreen tree with ornaments hanging off it. Try as you might, Cracker Barrel, there is no such thing as a Halloween Tree (no disrepect, Mr. Bradury)—it's still a Chrismtas Tree.
And then today I saw what is to me, the first official advertisement for Christmas—a trailer for a movie called “Red One,” a muddy looking CGI slugfest where Santa Clause has been kidnapped and needs to be saved.
The movie isn't even slated to be released until November! Couldn't they have at least waited until October to start advertising it? No?
Sigh.
Sunday, June 30, 2024
Four lines of code … it was four lines of code
I checked last night, and found the following:
CPU | requests | |
---|---|---|
gopher | 28:37 | 8360 |
gemini | 6:02 | 24345 |
How can a TCP service take more time than a TLS service? Especially when TLS runs over TCP? And TLS generates six times the packets than TCP?
There's some difference in how I handle TCP and TLS that's causing this. Profiling the code didn't reveal much as my attempts kept showing that gopher should be using less CPU than Gemini, so then it was just a matter of staring at code and thinking hard.
It wasn't easy. Yes, there are some differences between the TCP interface and the TLS interface, so it was a matter of going through the code, bringing them closer in implementation and seeing if that fixed things. I would fix the code, run a test, see it not make a difference, and try again. Eventually, I came across this bit of code:
-- --------------------------------------------------------------------- -- It might be a bug *somewhere*, but on Linux, this is required to get -- Unix domain sockets to work with the NFL driver. There's a race -- condition where writting data then calling close() may cause the -- other side to receive no data. This does NOT appoear to happen with -- TCP sockets, but this doesn't hurt the TCP side in any case. -- --------------------------------------------------------------------- while self.__socket.sendqueue and ios.__socket.sendqueue > 0 do nfl.SOCKETS:update(self.__socket,'w') coroutine.yield() end
This bit of code wasn't in the TLS implementation, and as the comment says, it “shouldn't” hurt the TCP path, but hey, it was only needed for local (or Unix) domain sockets to begin with, so let me try removing that bit of code and run a test.
CPU | real | user | sys | |
---|---|---|---|---|
gopher | 18 | 46.726 | 13.971 | 4.059 |
gemini | 4 | 107.928 | 3.951 | 0.322 |
CPU | real | user | sys | |
---|---|---|---|---|
gopher | 0 | 52.403 | 0.635 | 0.185 |
gemini | 4 | 103.290 | 3.957 | 0.285 |
THAT'S more like it!
Now, about that code … this is my best guess as to what was going on.
The sendqueue
field is a Linux-only field,
so I'm checking to see if it exists,
and then checking the value
(if only to avoid the “attempt to compare number with nil” Lua error on non-Linux systems).
This returns the number of bytes the kernel has yet to send,
so we switch the socket to trigger an event when we can write to it,
and yield.
Even worse,
the check isn't just a simple variable fetch,
but ends up being an ioctl()
system call.
Twice!
Now the “write” event code:
if event.write then if #ios.__output > 0 then local bytes,err = ios.__socket:send(nil,ios.__output) if err == 0 then ios.__wbytes = ios.__wbytes + bytes ios.__output = ios.__output:sub(bytes + 1,-1) else syslog('error',"socket:send() = %s",errno[err]) nfl.SOCKETS:remove(ios.__socket) nfl.schedule(ios.__co,false,errno[err],err) end end if #ios.__output == 0 then nfl.SOCKETS:update(ios.__socket,'r') nfl.schedule(ios.__co,true) end end
There's probably no data to be sent (ios.__output
is empty) so we immediately switch the socket back to trigger for readability,
then resume the coroutine handling the socket.
It goes back to the while
loop,
setting the socket back to trigger on writability,
and kept bouncing like that until the kernel buffer had been sent.
As I wrote, this fixed an issue I had with local domain sockets—it wasn't an issue with an IP socket, and in fact, was the wrong thing to do for an IP socket.
And as it is with these types of bugs, finding the root cause is not trivial, but the fix almost always is.