The Boston Diaries

The ongoing saga of a programmer who doesn't live in Boston, nor does he even like Boston, but yet named his weblog/journal “The Boston Diaries.”

Go figure.

Wednesday, May 06, 2020

The root cause of a crash

Found it!

It happend yet again and this time, I was able to get the stack traces! Woot! But as is the case with most bugs, the fix was laughably trivial once found. The trouble is always in finding it, and all one can hope for is to learn for next time.

So the normal pattern of requests came in and the program crashes (only this time with the stack traces I was hoping to capture):

remote=XXXXXXXX­XXXX status=20 request="gemini://gemini.conman.org/test/torture/" bytes=2505 subject="" issuer=""
remote=XXXX­XXXXXXXX­ status=20 request="gemini://gemini.conman.org/test/torture/0001" bytes=283 subject="" issuer=""
remote=XXXXXXXX­XXXX status=20 request="gemini://gemini.conman.org/test/torture/0002" bytes=249 subject="" issuer=""
remote=XXXX­XXXXXXXX­ status=20 request="gemini://gemini.conman.org/test/torture/0003" bytes=129 subject="" issuer=""
remote=XXXXXXXX­XXXX status=20 request="gemini://gemini.conman.org/test/torture/0004" bytes=205 subject="" issuer=""
remote=XXXX­XXXXXXXX­ status=20 request="gemini://gemini.conman.org/test/torture/0005" bytes=112 subject="" issuer=""
remote=XXXXXXXX­XXXX status=20 request="gemini://gemini.conman.org/test/torture/0006" bytes=248 subject="" issuer=""
remote=XXXX­XXXXXXXX­ status=20 request="gemini://gemini.conman.org/test/torture/0007" bytes=89 subject="" issuer=""
remote=XXXXXXXX­XXXX status=20 request="gemini://gemini.conman.org/test/torture/0008" bytes=360 subject="" issuer=""
remote=XXXX­XXXXXXXX­ status=59 request="" bytes=16 subject="" issuer=""
__gc stack traceback:  [C]: in method 'close'  /usr/local/share/lua/5.3/org/conman/nfl/tls.lua:126: in method 'close'  GLV-1.12556.lua:288: in function <GLV-1.12556.lua:232>
write stack traceback:  [C]: in method 'write'  /usr/local/share/lua/5.3/org/conman/nfl/tls.lua:92: in function </usr/local/share/lua/5.3/org/conman/nfl/tls.lua:91>  (...tail calls...)  GLV-1.12556.lua:206: in upvalue 'reply'  GLV-1.12556.lua:399: in function <GLV-1.12556.lua:232>

These series of requests exist to test Gemini client programs and since I patched the memory leak, the program has always crashed after test #8 (first clue). This test sees if the client can properly handle relative links:

Title: Link-full path, parent directory, no text
Status: 20
Content-Type: text/gemini

At this point, if you are writing a client that manipulates URLs directly,
you will need to look at RFC-3896, section 5.2.4 to handle these relative
paths correctly.  Things are about to get really messy.  If you want to skip
ahead to the MIME portion, select the second link instead of the first.

=> /test/../test/torture/0009
=> 0011

Heck, I used two different clients (one I wrote, and the Gemini Gateway), but I was never able to reproduce the error (second clue) so I never did bother to look deeper at this. It was only when I had the stack traces was I able to track the issue down to this bit of code (there's a bit more to see the context but the actual bug is in the last block of code below):

if not loc.host then
  log(ios,59,"",reply(ios,"59\t",MSG[59],"\r\n"))
  ios:close()
  return
end

if loc.scheme ~= 'gemini'
or loc.host   ~= CONF.network.host
or loc.port   ~= CONF.network.port then
  log(ios,59,"",reply(ios,"59\t",MSG[59],"\r\n"))
  ios:close()
  return
end

-- ---------------------------------------------------------------
-- Relative path resolution is the domain of the client, not the
-- server.  So reject any requests with relative path elements.
-- ---------------------------------------------------------------

if loc.path:match "/%.%./" or loc.path:match "/%./" then
  log(ios,59,"",reply(ios,"59\t",MSG[59],"\r\n"))
  ios:close()
end

These are error paths, and for each one, I close the connection and return from the handler, except for the case of a relative path!

Head, meet desk.

The clues I had were pointing right at the problem, but I just couldn't see it. The reason I couldn't reproduce the issue is that the two clients I used were implemented properly, so of course the check for a relative path wouldn't trigger. And the reason why I never found this before was because the memory leak prevented the __gc() method from running, which meant the TLS context pointer would never be overwritten with NULL, thus, no crash. Fixing that bug revealed this bug—one that has existed since I originally wrote the code.

A test by hand (by running ./openssl s_client -connect gemini.conman.org:19651 and typing in a request with a relative path) quickly revealed I had found the root cause of the crash.

Now, it might occur to some to ask why didn't I write the Ltls_close() function like this:

static int Ltls_close(lua_State *L)
{
  struct tls **luaL_checkudata(L,1,TYPE_TLS);
 
  if (*tls == NULL)
  {
    lua_pushinteger(TLS_ERROR);
    return 1;
  }
  
  int rc = tls_close(*(struct tls **)luaL_checkudata(L,1,TYPE_TLS));
  if ((rc != TLS_WANT_POLLIN) && (rc != TLS_WANT_POLLOUT))
    Ltls___gc(L);
  lua_pushinteger(L,rc);
  return 1;
}

That is—to have the function check for a NULL pointer? Or all the routines in the TLS module to check for a NULL pointer? That's a conscience decision on my part because defensive code hides real bugs (and in this case, it most definitely would have hidden the bug). I came to this conclusion having read Writing Solid Code years ago (and it was one of two books that radically changed how I approach programming—the other one being Thinking Forth; it's well worth reading even if you don't program in Forth). The book presents a solid case why defensive code can hide bugs much better than I can. And even if it took a week to track down the bug, it was worth it in the end.

Obligatory Picture

[The future's so bright, I gotta wear shades]

Obligatory Contact Info

Obligatory Feeds

Obligatory Links

Obligatory Miscellaneous

You have my permission to link freely to any entry here. Go ahead, I won't bite. I promise.

The dates are the permanent links to that day's entries (or entry, if there is only one entry). The titles are the permanent links to that entry only. The format for the links are simple: Start with the base link for this site: https://boston.conman.org/, then add the date you are interested in, say 2000/08/01, so that would make the final URL:

https://boston.conman.org/2000/08/01

You can also specify the entire month by leaving off the day portion. You can even select an arbitrary portion of time.

You may also note subtle shading of the links and that's intentional: the “closer” the link is (relative to the page) the “brighter” it appears. It's an experiment in using color shading to denote the distance a link is from here. If you don't notice it, don't worry; it's not all that important.

It is assumed that every brand name, slogan, corporate name, symbol, design element, et cetera mentioned in these pages is a protected and/or trademarked entity, the sole property of its owner(s), and acknowledgement of this status is implied.

Copyright © 1999-2024 by Sean Conner. All Rights Reserved.