Wednesday, May 06, 2020
The root cause of a crash
It happend yet again and this time, I was able to get the stack traces! Woot! But as is the case with most bugs, the fix was laughably trivial once found. The trouble is always in finding it, and all one can hope for is to learn for next time.
So the normal pattern of requests came in and the program crashes (only this time with the stack traces I was hoping to capture):
remote=XXXXXXXXXXXX status=20 request="gemini://gemini.conman.org/test/torture/" bytes=2505 subject="" issuer="" remote=XXXXXXXXXXXX status=20 request="gemini://gemini.conman.org/test/torture/0001" bytes=283 subject="" issuer="" remote=XXXXXXXXXXXX status=20 request="gemini://gemini.conman.org/test/torture/0002" bytes=249 subject="" issuer="" remote=XXXXXXXXXXXX status=20 request="gemini://gemini.conman.org/test/torture/0003" bytes=129 subject="" issuer="" remote=XXXXXXXXXXXX status=20 request="gemini://gemini.conman.org/test/torture/0004" bytes=205 subject="" issuer="" remote=XXXXXXXXXXXX status=20 request="gemini://gemini.conman.org/test/torture/0005" bytes=112 subject="" issuer="" remote=XXXXXXXXXXXX status=20 request="gemini://gemini.conman.org/test/torture/0006" bytes=248 subject="" issuer="" remote=XXXXXXXXXXXX status=20 request="gemini://gemini.conman.org/test/torture/0007" bytes=89 subject="" issuer="" remote=XXXXXXXXXXXX status=20 request="gemini://gemini.conman.org/test/torture/0008" bytes=360 subject="" issuer="" remote=XXXXXXXXXXXX status=59 request="" bytes=16 subject="" issuer="" __gc stack traceback: [C]: in method 'close' /usr/local/share/lua/5.3/org/conman/nfl/tls.lua:126: in method 'close' GLV-1.12556.lua:288: in function <GLV-1.12556.lua:232> write stack traceback: [C]: in method 'write' /usr/local/share/lua/5.3/org/conman/nfl/tls.lua:92: in function </usr/local/share/lua/5.3/org/conman/nfl/tls.lua:91> (...tail calls...) GLV-1.12556.lua:206: in upvalue 'reply' GLV-1.12556.lua:399: in function <GLV-1.12556.lua:232>
These series of requests exist to test Gemini client programs and since I patched the memory leak, the program has always crashed after test #8 (first clue). This test sees if the client can properly handle relative links:
Title: Link-full path, parent directory, no text Status: 20 Content-Type: text/gemini At this point, if you are writing a client that manipulates URLs directly, you will need to look at RFC-3896, section 5.2.4 to handle these relative paths correctly. Things are about to get really messy. If you want to skip ahead to the MIME portion, select the second link instead of the first. => /test/../test/torture/0009 => 0011
Heck, I used two different clients (one I wrote, and the Gemini Gateway), but I was never able to reproduce the error (second clue) so I never did bother to look deeper at this. It was only when I had the stack traces was I able to track the issue down to this bit of code (there's a bit more to see the context but the actual bug is in the last block of code below):
if not loc.host then log(ios,59,"",reply(ios,"59\t",MSG[59],"\r\n")) ios:close() return end if loc.scheme ~= 'gemini' or loc.host ~= CONF.network.host or loc.port ~= CONF.network.port then log(ios,59,"",reply(ios,"59\t",MSG[59],"\r\n")) ios:close() return end -- --------------------------------------------------------------- -- Relative path resolution is the domain of the client, not the -- server. So reject any requests with relative path elements. -- --------------------------------------------------------------- if loc.path:match "/%.%./" or loc.path:match "/%./" then log(ios,59,"",reply(ios,"59\t",MSG[59],"\r\n")) ios:close() end
These are error paths, and for each one, I close the connection and return from the handler, except for the case of a relative path!
Head, meet desk.
The clues I had were pointing right at the problem,
but I just couldn't see it.
The reason I couldn't reproduce the issue is that the two clients I used were implemented properly,
so of course the check for a relative path wouldn't trigger.
And the reason why I never found this before was because the memory leak prevented the __gc()
method from running,
which meant the TLS context pointer would never be overwritten with NULL
,
thus,
no crash.
Fixing that bug revealed this bug—one that has existed since I originally wrote the code.
A test by hand
(by running ./openssl s_client -connect gemini.conman.org:19651
and typing in a request with a relative path)
quickly revealed I had found the root cause of the crash.
Now,
it might occur to some to ask why didn't I write the Ltls_close()
function like this:
static int Ltls_close(lua_State *L) { struct tls **luaL_checkudata(L,1,TYPE_TLS); if (*tls == NULL) { lua_pushinteger(TLS_ERROR); return 1; } int rc = tls_close(*(struct tls **)luaL_checkudata(L,1,TYPE_TLS)); if ((rc != TLS_WANT_POLLIN) && (rc != TLS_WANT_POLLOUT)) Ltls___gc(L); lua_pushinteger(L,rc); return 1; }
That is—to have the function check for a NULL
pointer?
Or all the routines in the TLS module to check for a NULL
pointer?
That's a conscience decision on my part because defensive code hides real bugs
(and in this case,
it most definitely would have hidden the bug).
I came to this conclusion having read Writing Solid Code years ago
(and it was one of two books that radically changed how I approach programming—the other one being
Thinking Forth; it's well worth reading even if you don't program in Forth).
The book presents a solid case why defensive code can hide bugs much better than I can.
And even if it took a week to track down the bug,
it was worth it in the end.