The Boston Diaries

The ongoing saga of a programmer who doesn't live in Boston, nor does he even like Boston, but yet named his weblog/journal “The Boston Diaries.”

Go figure.

Monday, May 04, 2020

On the plus side, it happens the same way each time; on the down side, it doesn't happen often

I'm still trying to find the cause of the crash. I made a small change (a switch from using lua_touserdata(), which can return NULL, to using luaL_checkudata(), which if it returns, will never return NULL; else it throws an error) which didn't help resolve the issue—the program still dumped core.

Both last week and today, it crashed in the same place in Ltls_write():

static int Ltls_write(lua_State *L)
{
  struct tls **tls  = luaL_checkudata(L,1,TYPE_TLS);
  size_t       len;
  char const  *data = luaL_checklstring(L,2,&len);
  
  lua_pushinteger(L,tls_write(*tls,data,len));
  return 1;
}

The variable tls was itself not NULL, but the value it pointed to was! There's only one place in the code where it's set to NULL—in the __gc metamethod (the bits with the Lua registry is to deal with callbacks from libtls which knows nothing about Lua):

static int Ltls___gc(lua_State *L)
{
  struct tls **tls = luaL_checkudata(L,1,TYPE_TLS);
  if (*tls != NULL)
  {
    lua_pushlightuserdata(L,*tls);
    lua_pushnil(L);
    lua_settable(L,LUA_REGISTRYINDEX);
    tls_free(*tls);
    *tls = NULL;
  }
  return 0;
}

This is only called when the object no longer has any references to it. I'm not aware of any place where this value gets overwritten (but I'm not discounting it outright quite yet). The crashes started after I fixed the leak, which turned out to be this function (this is the pre-fix version):

static int Ltls_close(lua_State *L)
{
  lua_pushinteger(L,tls_close(*(struct tls **)lua_touserdata(L,1)));
  return 1;
}

The leak happened because I left the reference to the object in the Lua registry, so the garbage collector would never reclaim the resource. The fix itself was easy:

static int Ltls_close(lua_State *L)
{
  int rc = tls_close(*(struct tls **)luaL_checkudata(L,1,TYPE_TLS));
  if ((rc != TLS_WANT_POLLIN) && (rc != TLS_WANT_POLLOUT))
    Ltls___gc(L);
  lua_pushinteger(L,rc);
  return 1;
}

Yes, it's slightly complicated by the fact that closing a TLS connection might require a few more packets of data between the two endpoints to close out the protocol, but that's taken into account by the framework I'm using. At least, I hope it is.

So my theory right now is that the __gc metamethod is being called before the connection is fully closed, but to confirm that, I somehow need to capture that scenario. It's consistent in that it seems to be the same client making the same requests causing the same crash. On the down side, I have no idea who runs the client, or where it is (other than an IP address). And if I make the same requests with my client, it doesn't crash, so I can't reproduce it.

And there's no telling when it will happen again.

I know it's just a small detail I'm missing on my side that's causing the issue, but I just can't seem to locate the detail. To that end, I modified Ltls___gc() and Ltls_write() to attach some traceback information to see if my theory that Ltls___gc() is being called too early. Lua makes this pretty easy to do. I modified Ltls___gc():

static int Ltls___gc(lua_State *L)
{
  struct tls **tls = luaL_checkudata(L,1,TYPE_TLS);
  if (*tls != NULL)
  {
    luaL_traceback(L,L,"__gc",0);
    lua_setuservalue(L,1);
    lua_pushlightuserdata(L,*tls);
    lua_pushnil(L);
    lua_settable(L,LUA_REGISTRYINDEX);
    tls_free(*tls);
    *tls = NULL;
  }
  return 0;
}

I get a traceback message and assign it to the object about to be cleaned. In Lua, a userdata goes through the garbage collection twice—once to clean the resources that Lua isn't aware of (that's what this function does) and the second time (in the next garbage collection cycle) to reclaim the memory Lua uses to store the userdata. Then in Ltls_write() I have:

static int Ltls_write(lua_State *L)
{
  struct tls **tls  = luaL_checkudata(L,1,TYPE_TLS);
  size_t       len;
  char const  *data = luaL_checklstring(L,2,&len);
  
  if (*tls == NULL)
  {
    lua_getuservalue(L,1);
    if (!lua_isnil(L,-1))
      syslog(LOG_NOTICE,"%s",lua_tostring(L,-1));
    luaL_traceback(L,L,"write",0);
    syslog(LOG_NOTICE,"%s",lua_tostring(L,-1));
    lua_pop(L,2);
    lua_pushinteger(L,-1);
  }
  else
    lua_pushinteger(L,tls_write(*tls,data,len));
  return 1;
}

This will (hopefully, knock on formica) give me two tracebacks—one when the garbage collection is called (or Ltls_close() is called) and then a write is attempted, and that will point me to the detail I'm missing.

The other theory is that I'm overwriting memory somewhere and that's a harder issue to track down.

Obligatory Picture

[It's the most wonderful time of the year!]

Obligatory Links

Obligatory Miscellaneous

You have my permission to link freely to any entry here. Go ahead, I won't bite. I promise.

The dates are the permanent links to that day's entries (or entry, if there is only one entry). The titles are the permanent links to that entry only. The format for the links are simple: Start with the base link for this site: http://boston.conman.org/, then add the date you are interested in, say 2000/08/01, so that would make the final URL:

http://boston.conman.org/2000/08/01

You can also specify the entire month by leaving off the day portion. You can even select an arbitrary portion of time.

You may also note subtle shading of the links and that's intentional: the “closer” the link is (relative to the page) the “brighter” it appears. It's an experiment in using color shading to denote the distance a link is from here. If you don't notice it, don't worry; it's not all that important.

It is assumed that every brand name, slogan, corporate name, symbol, design element, et cetera mentioned in these pages is a protected and/or trademarked entity, the sole property of its owner(s), and acknowledgement of this status is implied.

Copyright © 1999-2020 by Sean Conner. All Rights Reserved.