The Boston Diaries

The ongoing saga of a programmer who doesn't live in Boston, nor does he even like Boston, but yet named his weblog/journal “The Boston Diaries.”

Go figure.

Sunday, July 21, 2024

Fixing an Apache pthread error

After I fixed the performance issue, I started looking around for some other issues to handle, and boy, did I find some. Checking the error log from Apache I found:

[Fri Jul 12 15:04:01.762845 2024] [mpm_worker:alert] [pid 31979:tid 3924761520] (12)Cannot allocate memory: AH03142: ap_thread_create: unable to create worker thread
libgcc_s.so.1 must be installed for pthread_cancel to work
[Fri Jul 12 15:04:02.731958 2024] [core:notice] [pid 19646:tid 4149040832] AH00052: child pid 31979 exit signal Aborted (6)
[Fri Jul 12 15:04:02.735360 2024] [mpm_worker:alert] [pid 32021:tid 3924761520] (12)Cannot allocate memory: AH03142: ap_thread_create: unable to create worker thread
libgcc_s.so.1 must be installed for pthread_cancel to work
[Fri Jul 12 15:04:03.733536 2024] [core:notice] [pid 19646:tid 4149040832] AH00052: child pid 32021 exit signal Aborted (6)
[Fri Jul 12 15:04:03.736857 2024] [mpm_worker:alert] [pid 32063:tid 3924761520] (12)Cannot allocate memory: AH03142: ap_thread_create: unable to create worker thread
libgcc_s.so.1 must be installed for pthread_cancel to work
[Fri Jul 12 15:04:04.735368 2024] [core:notice] [pid 19646:tid 4149040832] AH00052: child pid 32063 exit signal Aborted (6)
[Fri Jul 12 15:04:04.738624 2024] [mpm_worker:alert] [pid 32105:tid 3924761520] (12)Cannot allocate memory: AH03142: ap_thread_create: unable to create worker thread
libgcc_s.so.1 must be installed for pthread_cancel to work
[Fri Jul 12 15:04:05.737141 2024] [core:notice] [pid 19646:tid 4149040832] AH00052: child pid 32105 exit signal Aborted (6)
[Fri Jul 12 15:04:05.740622 2024] [mpm_worker:alert] [pid 32147:tid 3924761520] (12)Cannot allocate memory: AH03142: ap_thread_create: unable to create worker thread
libgcc_s.so.1 must be installed for pthread_cancel to work
[Fri Jul 12 15:04:06.739077 2024] [core:notice] [pid 19646:tid 4149040832] AH00052: child pid 32147 exit signal Aborted (6)
[Fri Jul 12 15:04:06.742500 2024] [mpm_worker:alert] [pid 32189:tid 3924761520] (12)Cannot allocate memory: AH03142: ap_thread_create: unable to create worker thread
libgcc_s.so.1 must be installed for pthread_cancel to work
[Fri Jul 12 15:04:07.740898 2024] [core:notice] [pid 19646:tid 4149040832] AH00052: child pid 32189 exit signal Aborted (6)
[Fri Jul 12 15:04:07.744130 2024] [mpm_worker:alert] [pid 32231:tid 3924761520] (12)Cannot allocate memory: AH03142: ap_thread_create: unable to create worker thread
libgcc_s.so.1 must be installed for pthread_cancel to work
[Fri Jul 12 15:04:08.742689 2024] [core:notice] [pid 19646:tid 4149040832] AH00052: child pid 32231 exit signal Aborted (6)
[Fri Jul 12 15:04:08.745885 2024] [mpm_worker:alert] [pid 32274:tid 3924761520] (12)Cannot allocate memory: AH03142: ap_thread_create: unable to create worker thread
libgcc_s.so.1 must be installed for pthread_cancel to work
[Fri Jul 12 15:04:09.744445 2024] [core:notice] [pid 19646:tid 4149040832] AH00052: child pid 32274 exit signal Aborted (6)
[Fri Jul 12 15:04:09.747607 2024] [mpm_worker:alert] [pid 32316:tid 3924761520] (12)Cannot allocate memory: AH03142: ap_thread_create: unable to create worker thread
libgcc_s.so.1 must be installed for pthread_cancel to work

Thousands of such lines. And the weird thing is that libgcc_s.so.1 does exist on my system. I guess I missed that library when I installed Apache from source. I'm not even curious as to why this is an issue, nor why libgcc_s.so.1 is needed. At this point I'm like “give the system what it wants, not what I want to give it” (which I can't say in polite company). Running configure –help didn't show any obvious means of enabling the use of libgcc_s.so.1, and not wanting to dive deep into a maze of twisty Makefiles all alike, I decided on the next best thing. I went to the Apache build directory, deleted the existing httpd binary and ran make. This gave me the final line used to build the executable:

/home/spc/apps/httpd-2.4.54/srclib/apr/libtool --silent --mode=link gcc -std=gnu99  -g -O2 -pthread -o httpd  modules.lo buildmark.o -export-dynamic server/libmain.la modules/core/libmod_so.la modules/http/libmod_http.la server/mpm/worker/libworker.la os/unix/libos.la -L/usr/local/lib -lpcre    /home/spc/apps/httpd-2.4.54/srclib/apr-util/libaprutil-1.la -lexpat /home/spc/apps/httpd-2.4.54/srclib/apr/libapr-1.la -luuid -lrt -lcrypt -lpthread -ldl

I then reran just this command but with /lib/libgcc_s.so.1 added before the -pthread option. A quick check afterwards:

[spc]brevard:~/apps/httpd-2.4.54>ldd .libs/httpd
	linux-gate.so.1 =>  (0xf77e1000)
	libgcc_s.so.1 => /lib/libgcc_s.so.1 (0xf77cf000)
	libpcre.so.1 => /usr/local/lib/libpcre.so.1 (0xf77b1000)
	libaprutil-1.so.0 => /usr/local/apache2/lib/libaprutil-1.so.0 (0xf778c000)
	libexpat.so.0 => /usr/lib/libexpat.so.0 (0xf776e000)
	libapr-1.so.0 => /usr/local/apache2/lib/libapr-1.so.0 (0xf7742000)
	libuuid.so.1 => /lib/libuuid.so.1 (0xf773f000)
	librt.so.1 => /lib/tls/librt.so.1 (0xf772a000)
	libcrypt.so.1 => /lib/libcrypt.so.1 (0xf76fc000)
	libpthread.so.0 => /lib/tls/libpthread.so.0 (0xf76ea000)
	libdl.so.2 => /lib/libdl.so.2 (0xf76e6000)
	libc.so.6 => /lib/tls/libc.so.6 (0xf75bb000)
	/lib/ld-linux.so.2 (0xf77e2000)

and it worked—libgcc_s.so.1 is linked in!

I installed the new Apache executable, restarted Apache and … it didn't crash! I then left it to run.

That was a week ago, and so far, so good. The error log is rotated weekly and for the past week no such errors have appeared. Now only if I could nuke from orbit the crawlers sending in silly requests like /cgi-bin/%%32%65%%32%65/%%32%65%%32%65/%%32%65%%32%65/%%32%65%%32%65/%%32%65%%32%65/%%32%65%%32%65/%%32%65%%32%65/bin/sh—it just clutters up the error log with “invalid URI path” errors.

Sigh.

Thursday, July 04, 2024

Something something Independence something something Declaration something something Blow Stuff Up

It just now became the Forth of July™ and earlier this evening (which was technically yesterday, Wednesday, July 3rd) some people were illegally lighting off fireworks already. I guess as a random quality test of the munitions they've aquired. So to them, and to you, I wish you a happy and safe National Blow Stuff Up Day!

[Amateurs: There's a reason professionals exist.]

And for the curious, the above image was taken by me 21 years ago. Yikes!

And it was 13 years ago when my hair nearly caught on fire, so when I say “be safe,” I mean it.

Sunday, June 30, 2024

Four lines of code … it was four lines of code

It was bugging me.

I checked last night, and found the following:

CPU utilization after a week
  CPU requests
gopher 28:37 8360
gemini 6:02 24345

How can a TCP service take more time than a TLS service? Especially when TLS runs over TCP? And TLS generates six times the packets than TCP?

There's some difference in how I handle TCP and TLS that's causing this. Profiling the code didn't reveal much as my attempts kept showing that gopher should be using less CPU than Gemini, so then it was just a matter of staring at code and thinking hard.

It wasn't easy. Yes, there are some differences between the TCP interface and the TLS interface, so it was a matter of going through the code, bringing them closer in implementation and seeing if that fixed things. I would fix the code, run a test, see it not make a difference, and try again. Eventually, I came across this bit of code:

-- ---------------------------------------------------------------------
-- It might be a bug *somewhere*, but on Linux, this is required to get
-- Unix domain sockets to work with the NFL driver.  There's a race
-- condition where writting data then calling close() may cause the
-- other side to receive no data.  This does NOT appoear to happen with
-- TCP sockets, but this doesn't hurt the TCP side in any case.
-- ---------------------------------------------------------------------
    
while self.__socket.sendqueue and ios.__socket.sendqueue > 0 do
  nfl.SOCKETS:update(self.__socket,'w')
  coroutine.yield()
end

This bit of code wasn't in the TLS implementation, and as the comment says, it “shouldn't” hurt the TCP path, but hey, it was only needed for local (or Unix) domain sockets to begin with, so let me try removing that bit of code and run a test.

Test before removing the code, 275 requests; time in seconds
  CPU real user sys
gopher 18 46.726 13.971 4.059
gemini 4 107.928 3.951 0.322
Test after removing the code, 275 requests; time in seconds
  CPU real user sys
gopher 0 52.403 0.635 0.185
gemini 4 103.290 3.957 0.285

THAT'S more like it!

Now, about that code … this is my best guess as to what was going on.

The sendqueue field is a Linux-only field, so I'm checking to see if it exists, and then checking the value (if only to avoid the “attempt to compare number with nil” Lua error on non-Linux systems). This returns the number of bytes the kernel has yet to send, so we switch the socket to trigger an event when we can write to it, and yield. Even worse, the check isn't just a simple variable fetch, but ends up being an ioctl() system call. Twice!

Now the “write” event code:

if event.write then
  if #ios.__output > 0 then
    local bytes,err = ios.__socket:send(nil,ios.__output)
    if err == 0 then
      ios.__wbytes = ios.__wbytes + bytes
      ios.__output = ios.__output:sub(bytes + 1,-1)
    else
      syslog('error',"socket:send() = %s",errno[err])
      nfl.SOCKETS:remove(ios.__socket)
      nfl.schedule(ios.__co,false,errno[err],err)
    end
  end
  
  if #ios.__output == 0 then
    nfl.SOCKETS:update(ios.__socket,'r')
    nfl.schedule(ios.__co,true)
  end
end

There's probably no data to be sent (ios.__output is empty) so we immediately switch the socket back to trigger for readability, then resume the coroutine handling the socket. It goes back to the while loop, setting the socket back to trigger on writability, and kept bouncing like that until the kernel buffer had been sent.

As I wrote, this fixed an issue I had with local domain sockets—it wasn't an issue with an IP socket, and in fact, was the wrong thing to do for an IP socket.

And as it is with these types of bugs, finding the root cause is not trivial, but the fix almost always is.


Discussions about this entry

Thursday, June 27, 2024

IT HAS BEGUN! (“It” being Christmas themed advertising)

Last night at the Cracker Barrel, Bunny and I saw store displays nominally for Halloween. But nestled in the middle of the display was an evergreen tree with ornaments hanging off it. Try as you might, Cracker Barrel, there is no such thing as a Halloween Tree (no disrepect, Mr. Bradury)—it's still a Chrismtas Tree.

And then today I saw what is to me, the first official advertisement for Christmas—a trailer for a movie called “Red One,” a muddy looking CGI slugfest where Santa Clause has been kidnapped and needs to be saved.

The movie isn't even slated to be released until November! Couldn't they have at least waited until October to start advertising it? No?

Sigh.

Sunday, June 23, 2024

The case of the weird profiling results

Okay, I'm flummoxed.

After a bit of thought, I had an idea on how to determine where the CPU time is going on my gopher server—I have at least two working hypotheses that I can test, wraping text and translating links. So, to test these, I will get a list of requests since the logs rotated earlier today, and then run them through the gopher server (a second one set up on my public server just for these tests) to get a baseline; then remove the link translations and run the requests again; then remove wraping the text and run the requests a third time and see what I get.

For the first test, the expected results should result with the test gopher server having the same CPU utilization as the one that normally runs. It makes sense right—the normal one has x CPU utilization with y requests, so if I feed the test gopher server the same requests (even though all the requests come from the same location) it should have the same CPU utilization. Or at least close enough to show that I'm on the right track.

When I pulled the requests out of the logs files and checked the current gopher server, it had received 332 requests and racked up 0:47 in CPU time. I set up the test gopher server (the only changes to the configuration—different log file and a different port which meant I didn't have to run as root). I made the 332 requests to my test gopher server and I get a CPU time of 0:22.

What?

I reran this test a few times and got the same results each time—0:22.

Which is less than half the time the normal gopher server with the same number of requests. If it was one or two seconds off, hey, close enough. But half?

It's not making sense.

But I decided to coninue and run the other tests. First, I removed the code that does the link translation, rerand the requests and got a CPU time of 0:20.

Okay, that tell me two things—one, the link translations do take time, but I don't think it's enough to explain the CPU utilization of the gopher server. Maybe. On a whim, I decided to change the third test to not even bother with processing blog entries—I modified the blog entry handler to just return. Given that ¾ of the requests are to the blog, this should make it run much faster and use less CPU time.

I got a CPU time of 0:18.

I don't know what to think. I would have expected this to be 0:05 or 0:06, given that 75% of the requests would not be generated. Something weird is going on.

It was time to take a step back. I went back to the original bench mark but instead of counting lines executed, I decided to count Lua VM instructions for the decently sized request.

Lua VM cycles to serve a request
gopher 61622
gemini 67401

Okay, the gopher server is clearly doing less Lua VM instructions than the Gemini server. Could I gain any insight from profiling at the C level? I had already done most of the work to profile both the gopher and Gemini servers. To make sure I got enough data, I ran the decently sized request three times for each server.

port70 (gopher server)—Each sample counts as 0.01 seconds.
% time cumulative seconds self secondscalls self seconds ms/call total ms/call name
% time cumulative seconds self seconds calls self seconds ms/call total ms/call name
13.79 0.04 0.04 185781 0.00 0.00 luaV_execute
10.34 0.07 0.03 734588 0.00 0.00 index2value
6.90 0.09 0.02 398225 0.00 0.00 luaS_new
6.90 0.11 0.02 45028 0.00 0.00 luaH_newkey
3.45 0.12 0.01 1041939 0.00 0.00 yymatchChar
3.45 0.13 0.01 924394 0.00 0.00 luaG_traceexec
3.45 0.14 0.01 503889 0.00 0.00 yyText
3.45 0.15 0.01 260252 0.00 0.00 luaD_precall
3.45 0.16 0.01 245893 0.00 0.00 mainpositionTV
3.45 0.17 0.01 201753 0.00 0.00 auxgetstr
3.45 0.18 0.01 191931 0.00 0.00 yy_S
3.45 0.19 0.01 185373 0.00 0.00 equalkey
3.45 0.20 0.01 134088 0.00 0.00 yyDo
3.45 0.21 0.01 129432 0.00 0.00 yy_CHAR
3.45 0.22 0.01 101937 0.00 0.00 reverse
3.45 0.23 0.01 34759 0.00 0.00 luaH_getn
3.45 0.24 0.01 4473 0.00 0.00 getfirst
3.45 0.25 0.01 1625 0.01 0.01 traverseproto
3.45 0.26 0.01 834 0.01 0.01 strcore_wrapt
3.45 0.27 0.01 61 0.16 0.16 checkloops
3.45 0.28 0.01 2 5.00 5.00 deletelist
3.45 0.29 0.01       cclasscmp
0.00 0.29 0.00 924377 0.00 0.00 luaD_hook
0.00 0.29 0.00 391575 0.00 0.00 yymatchString
0.00 0.29 0.00 358374 0.00 0.00 luaH_getshortstr
0.00 0.29 0.00 261889 0.00 0.00 prepCallInfo
0.00 0.29 0.00 261296 0.00 0.00 luaD_poscall
GLV-1.12556 (Gemini server)—Each sample counts as 0.01 seconds.
% time cumulative seconds self seconds calls self seconds ms/call total ms/call name
% time cumulative seconds self seconds calls self seconds ms/call total ms/call name
9.38 0.03 0.03 1011065 0.00 0.00 luaG_traceexec
9.38 0.06 0.03 1011056 0.00 0.00 luaD_hook
7.81 0.09 0.03 204707 0.00 0.00 luaV_execute
6.25 0.10 0.02 443861 0.00 0.00 luaS_new
6.25 0.12 0.02 396470 0.00 0.00 luaH_getshortstr
3.12 0.14 0.01 668980 0.00 0.00 index2value
3.12 0.14 0.01 391575 0.00 0.00 yymatchString
3.12 0.15 0.01 271008 0.00 0.00 mainpositionTV
3.12 0.17 0.01 243043 0.00 0.00 luaD_precall
3.12 0.17 0.01 242834 0.00 0.00 moveresults
3.12 0.18 0.01 217137 0.00 0.00 ccall
3.12 0.20 0.01 202203 0.00 0.00 hookf
3.12 0.20 0.01 129432 0.00 0.00 yy_CHAR
3.12 0.21 0.01 56698 0.00 0.00 llex
3.12 0.23 0.01 39543 0.00 0.00 internshrstr
3.12 0.23 0.01 30301 0.00 0.00 luaM_malloc_
3.12 0.24 0.01 23821 0.00 0.00 luaH_realasize
3.12 0.26 0.01 5906 0.00 0.00 luaV_concat
3.12 0.27 0.01 4149 0.00 0.00 GCTM
3.12 0.28 0.01 1872 0.01 0.02 yy_PCDATA
3.12 0.28 0.01 928 0.01 0.01 match
3.12 0.29 0.01 787 0.01 0.01 luaF_newLclosure
3.12 0.30 0.01 595 0.02 0.02 luaK_int
3.12 0.32 0.01 59 0.17 0.17 l_strcmp
1.56 0.32 0.01 12 0.42 0.42 luaV_finishOp
0.00 0.32 0.00 1041939 0.00 0.00 yymatchChar

This is not easy to interpret. As expected, the Lua VM shows up in the top spots for both, but nothing really stands out. It is nice to see that yymatchChar and yymatchString (both in the HTML parsing module) are called the same number of times (expected) but the times are different. The Lua function index2value is called a different number of times, but that might be due to code differences. I think the higher percentage of time in the gopher server might be due to taking less time overall? For instance, in the gopher server, deletelist has a self ms/call of 5, but on the Gemini server it has a self ms/call of 0.52—is the garbage collector being called more often in the gopher server? Or is it because of less time overall, it shows up with a higher time? It's only called twice in both codebases. I'm not sure how to interpret this.

Somehow, the gopher server is doing less work than the Gemini server, yet accumulating more CPU time than the Gemini server, despite getting about 10% of the requests as the Gemini server. I don't understand how that can be.

I think I'll leave things alone for now. I went several years with a not-optimized gopher server, so I think I can live with this new version for now.


How does TLS use less CPU than plain TCP, part II

I noticed over the past couple of days that the CPU utilization were similar between the two services, but looking at the logs it seemed my gopher server git hit with several bots. So last night just as the logs reset (a weekly job) I decided to restart the two services so as to match CPU utilization and requests. Twelve hour later and …

CPU utilization after 12 hours
CPU requests
gopher 0:28 175
gemini 0:26 1744

A week and a half after calling “Mission Complete” and apparently not. Even worse, my gopher server is using a bit more CPU than Gemini server while getting one-tenth the number of requests. Something is off here.

I checked the requests and both receive over 90% requests for my blog, so I'm confident that it's something to do with that. Both use the same request framework, both use the same code to read blog entries, both use the same code to parse the HTML, so it's not that. The only difference left is the formatting of each entry. The gopher server does wrap the text to 80 columns (not needed for the Gemini server), and there's quite a bit more code to deal with fixing the links … are those the causes? It's hard to determine what's causing the excessive CPU utilization.

I'm going to have to think about how to determine the issue.

Heck, I'm going to have to think if this is even worth determining. I mean, the CPU utilization is better, but it's still puzzling.

Monday, June 17, 2024

When a file doesn't exist, yet it's right there

For reasons, I'm downloading a file from several hundred different domains. The results start trickling in and I decide to take a look at one of them:

[spc]lucy:/tmp/cache>ls 04d.co
04d.co
[spc]lucy:/tmp/cache>more 04d.co
Error: File or directory not found!
[spc]lucy:/tmp/cache>

Um … what?

Okay. There could be an invisible character or two in the filename.

[spc]lucy:/tmp/cache>echo * | hex | more
00000000: 30 34 64 2E 63 6F 20 31 34 33 36 2E 6E 69 6E 6A 04d.co 1436.ninj
...
[spc]lucy:/tmp/cache>

Nope. But what could cause the shell not to find the file when the file is right there! The name shouldn't matter. But I can't view it.

[spc]lucy:/tmp/cache>cat 04d.co
Error: File or directory not found!
[spc]lucy:/tmp/cache>

Okay, what if I try to move it somewhere?

[spc]lucy:~/tmp>cp /tmp/cache/04d.co .
[spc]lucy:~/tmp>more 04d.co
Error: File or directory not found!
[spc]lucy:~/tmp>

I can list the file:

[spc]lucy:~/tmp>ls -l 04d.co
-rw-r--r--  1 spc spc 37 Jun 17 22:25 04d.co
[spc]lucy:~/tmp>

Let me try one last thing …

[spc]lucy:~/tmp>hex 04d.co
00000000: 45 72 72 6F 72 3A 20 46 69 6C 65 20 6F 72 20 64 Error: File or d
00000010: 69 72 65 63 74 6F 72 79 20 6E 6F 74 20 66 6F 75 irectory not fou
00000020: 6E 64 21 0D 0A                                  nd!..           

I was downloading these files from the Intenet. This particular site didn't have the file. The contents of the file is the error message.

Head. Meet desk.

This reminds me of college. Back then, the command finger was popular, and when using it, the program would attempt to locate a file called .plan in the home directory and display it. I made a joke .plan file that read:

Bus error - core dumped

You have new mail.

I can't count the number of times someone tried to finger me, only to attempt to remove a non-existent core file and impulsively check their email.

Wednesday, June 12, 2024

Just when I thought it was safe to run a gopher server

It appeared I had traded the problem of high CPU usage for a memory leak. Last night I saw:

USER       PID %CPU %MEM     VSZ     RSS TTY   STAT START TIME COMMAND
gopher   31875  0.2 13.3 2174076 2171960 ttyp4 S+   Jun09 7:08 lua port70.lua /home/spc/gopher/config.lua

(For the record—the VSZ and RSS values are the number of 4,096 byte blocks of memory for various reasons beyond the scope of this post) and just before lunch today:

USER       PID %CPU %MEM     VSZ     RSS TTY   STAT START TIME COMMAND
gopher    1572  0.2 11.1 1809672 1807644 ttyp4 S+   04:30 1:07 lua port70.lua /home/spc/gopher/config.lua

Not good. Especially when it's normally like:

USER       PID %CPU %MEM   VSZ  RSS TTY   STAT START TIME COMMAND
gopher   20711  0.2  0.0 10588 9000 ttyp4 S+   17:52 0:06 lua port70.lua /home/spc/gopher/config.lua

And then there was this in the logs:

Jun 12 09:47:54	daemon	err	71.19.142.20	gopher	CRASH: coroutine thread: 0x8400404 dead: not enough memory
Jun 12 09:47:54	daemon	err	71.19.142.20	gopher	CRASH: thread: 0x8400404: stack traceback:
Jun 12 09:47:54	daemon	err	71.19.142.20	gopher	CRASH: thread: 0x8400404:  [C]: in function 'org.conman.string.wrapt'
Jun 12 09:47:54	daemon	err	71.19.142.20	gopher	CRASH: thread: 0x8400404:  ...e/lua/5.4/org/conman/app/port70/handlers/blog/format.lua:34: in upvalue 'wrap_text'
Jun 12 09:47:54	daemon	err	71.19.142.20	gopher	CRASH: thread: 0x8400404:  ...e/lua/5.4/org/conman/app/port70/handlers/blog/format.lua:119: in upvalue 'run_flow'
Jun 12 09:47:54	daemon	err	71.19.142.20	gopher	CRASH: thread: 0x8400404:  ...e/lua/5.4/org/conman/app/port70/handlers/blog/format.lua:598: in function 'org.conman.app.port70.handlers.blog.format'
Jun 12 09:47:54	daemon	err	71.19.142.20	gopher	CRASH: thread: 0x8400404:  ...al/share/lua/5.4/org/conman/app/port70/handlers/blog.lua:227: in function <...al/share/lua/5.4/org/conman/app/port70/handlers/blog.lua:210>
Jun 12 09:47:54	daemon	err	71.19.142.20	gopher	CRASH: thread: 0x8400404:  (...tail calls...)
Jun 12 09:47:54	daemon	err	71.19.142.20	gopher	CRASH: thread: 0x8400404:  port70.lua:238: in function <port70.lua:205>

Yeah, “not enough memory” while running the new C-based text wrapping function. I added a few checks to see if wrapt() wasn't updating the character indicies correctly (and log if they weren't) then went for lunch. Afterwards, I check and lo', memory was still growing, and none of the checks I added had triggered.

Hmmm. Time to take down the gopher server and debug some code.

Sure enough, I found the one entry that went off the rails. What followed was a few hours of troubleshooting to find out why that one entry (and only that entry) blows up. And it came down to a difference of semantics between Lua and C (no, it had nothing to do with C's 0-based indexing and Lua's 1-based indexing). Basically, the Lua code looked like:

local breakhere -- this is nil

...

if ctype == 'space' then
  breakhere = i -- this is now an integer
  ...
end

if breakhere then
  table.insert(result,s:sub(front,breakhere - 1))
  ...
end

breakhere = nil

In Lua, only the values nil and false are considered “false”—a 0 value is “true” in Lua. In C, any 0 value is considered “false.” I knew this when translating the code, but it never occurred to me that a break point of a line could be at the start of a line.

Until it was at the start of a line in that one entry—the code went into an infinite loop trying to wrap text, thus causing the memory usage to consume the computer.

Sigh.

It was easy enough to fix once the problem was diagnosed. And maybe now things will get back to normal.

Obligatory Picture

[“I am NOT a number, I am … a Q-CODE!”]

Obligatory Contact Info

Obligatory Feeds

Obligatory Links

Obligatory Miscellaneous

You have my permission to link freely to any entry here. Go ahead, I won't bite. I promise.

The dates are the permanent links to that day's entries (or entry, if there is only one entry). The titles are the permanent links to that entry only. The format for the links are simple: Start with the base link for this site: https://boston.conman.org/, then add the date you are interested in, say 2000/08/01, so that would make the final URL:

https://boston.conman.org/2000/08/01

You can also specify the entire month by leaving off the day portion. You can even select an arbitrary portion of time.

You may also note subtle shading of the links and that's intentional: the “closer” the link is (relative to the page) the “brighter” it appears. It's an experiment in using color shading to denote the distance a link is from here. If you don't notice it, don't worry; it's not all that important.

It is assumed that every brand name, slogan, corporate name, symbol, design element, et cetera mentioned in these pages is a protected and/or trademarked entity, the sole property of its owner(s), and acknowledgement of this status is implied.

Copyright © 1999-2024 by Sean Conner. All Rights Reserved.