The Boston Diaries

The ongoing saga of a programmer who doesn't live in Boston, nor does he even like Boston, but yet named his weblog/journal “The Boston Diaries.”

Go figure.

Tuesday, February 04, 2025

Concurrency is tricky

As I was writing the previous entry I got the nagging feeling that something wasn't quite right with the code. I got distracted yesterday helping a friend bounce programming issues off me, but after that, I was able to take a good look at the code and figured out what I did wrong.

Well, not “wrong” per se, the code as it worked—it's just that it could fail catastrophically in the right conditions (or maybe wrong conditions, depending upon your view).

But first, a bit about how my network sever framework works. The core bit of code is this:

local function eventloop(done_f)
  if done_f() then return end

  -- calculate and handle timeouts
  -- each coroutine that timed out is
  -- scheduled to run on the RUNQUEUE,
  -- with nil and ETIMEDOUT.

  SOCKETS:wait(timeout)
  for event in SOCKETS:events() do
    event.obj(event)
  end

  while #RUNQUEUE > 0 do
    -- run each coroutine in the run queue until
    -- it eithers yields or returns (meaning
    -- it's finished running).
  end

  return eventloop(done_f)
end

Details are emitted full gory details here but in general, the event loop calls a passed in function to check if we need to shut down, then calculates a timeout value while checking for coroutines that registered a timeout. If any did, we add the coroutine to a run queue with nil and ETIMEDOUT to inform the resuming coroutine that it timed out. Then we scan a set of network sockets for activity with SOCKETS:wait() (on Linux, this ends up calling epoll_wait(); BSDs with kqueue() and most other Unix systems with poll()). We then call the handling function for each event. These can end up creating new coroutines and scheduling coroutines to run (these will be added to the run queue). And then for each coroutine in the run queue, we run it. Lather, rinse, repeat. Simple enough.

Now, on to the code I presented. This code registers a function to run when the given UDP socket recieves a packet of data, and schedules a number of coroutines waiting for data to run. This happens in the eventloop() function.

nfl.SOCKETS:insert(lsock,'r',function()
  local _,data,err = lsock:recv()
  if data then
    for co in pairs(clients) do
      nfl.schedule(co,data) -- problem
    end
  else
    syslog('error',"recv()=%s",errno[err])
  end
end)

I've noted a problematic line of code here.

And now the core of the routine to handle a TLS connection. This code yields to receive data, then writes the data to the TLS socket.

  while true do
    local data = coroutine.yield()
    if not data then break end
    local okay,errmsg = ios:write(data,'\n') -- <<< HERE
    if not okay then
      syslog('error',"tls:read() = %s",errmsg)
      break
    end
  end

I've marked where the root cause lies, and it's pretty subtle I think. The core issue is that ios:write() here could block, because the kernel output buffer is full and we need to wait for the kernel to send it. But the code that handles the UDP socket just assumes that the TLS coroutine is ready for more data. If ios:write() blocks and more UDP data comes on, the coroutine is prematurely resumed with the data, but that's just taken by the TLS thread as the write being successful, then yielding and then things get … weird, as the UDP side and the TLS side are now out of sync with each other. This, fortunately, hasn't trigger on me.

Yet.

It could, if too much was being logged to syslog. I wrote the following code to test it out:

#include <syslog.h>

#define MSG " !\"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXUZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~"

int main(void)
{
  int i;
  for (i = 0 ; i < 500 ; i++)
    syslog(LOG_DEBUG,"%3d " MSG MSG MSG,i);
  return 0;
}

And sure enough, the spice data stopped flowing.

What I needed to do was queue up the log messages to a given client, and only schedule it to run when it's waiting for more data. A few failed attempts followed—they were all based on scheduling the TLS thread when X number of messages were queued up (I tried one, then zero; neither worked). It worked much better by using a flag to indicate when the TLS coroutine wanted to be scheduled or not.

The UDP socket code is now:

nfl.SOCKETS:insert(lsock,'r',function()
  local _,data,err = lsock:recv()
  if data then
    for co,queue in pairs(clients) do
      table.insert(queue,data)
      if queue.ready then
        nfl.schedule(co,data)
      end
    end
  else
    syslog('error',"recv()=%s",errno[err])
  end
end)

The client list now contains a list of logs to send, along with a flag that the TLS coroutine sets indicating if it needs running or not. This takes advantage of Lua's tables which can have a hash part (named indices) and an array part, so we can include a flag in the queue.

And now the updated TLS coroutine:

local function client_main(ios)
  local function main()
    while #clients[ios.__co] > 0 do
      local data     = table.remove(clients[ios.__co],1)
      local okay,err = ios:write(data,'\n')
      if not okay then
        syslog('error',"tls:write()=%s",err)
        return
      end
    end
    
    clients[ios.__co].ready = true
    if not coroutine.yield() then
      return
    end
    clients[ios.__co].ready = false
    return main()
  end
  
  ios:_handshake()
  
  if ios.__ctx:peer_cert_issuer() ~= ISSUER then
    ios:close()
    return
  end
  
  syslog('info',"remote=%s",ios.__remote.addr)
  clients[ios.__co] = { ready = false }
  main()
  clients[ios.__co] = nil
  syslog('info',"remote=%s disconnecting",ios.__remote.addr)
  ios:close()
end

The core of the routine, the nested function main() does the real work here. When main() starts, the flag for queue readiness is false. It then runs through its input queue sending data to the client. Once that is done, it sets the queue readiness flag to true and then yields. Once it resumes, it sets the queue readiness flag to 'false' and (through a tail call) starts over again.

This ensures that logs are queued properly for delivery, and running the C test program again showed it works.

Obligatory Picture

Dad was resigned to the fact that I was, indeed, a landlubber, and turned the boat around yet again …

Obligatory Contact Info

Obligatory Feeds

Obligatory Links

Obligatory Miscellaneous

Obligatory AI Disclaimer

No AI was used in the making of this site, unless otherwise noted.

You have my permission to link freely to any entry here. Go ahead, I won't bite. I promise.

The dates are the permanent links to that day's entries (or entry, if there is only one entry). The titles are the permanent links to that entry only. The format for the links are simple: Start with the base link for this site: https://boston.conman.org/, then add the date you are interested in, say 2000/08/01, so that would make the final URL:

https://boston.conman.org/2000/08/01

You can also specify the entire month by leaving off the day portion. You can even select an arbitrary portion of time.

You may also note subtle shading of the links and that's intentional: the “closer” the link is (relative to the page) the “brighter” it appears. It's an experiment in using color shading to denote the distance a link is from here. If you don't notice it, don't worry; it's not all that important.

It is assumed that every brand name, slogan, corporate name, symbol, design element, et cetera mentioned in these pages is a protected and/or trademarked entity, the sole property of its owner(s), and acknowledgement of this status is implied.

Copyright © 1999-2025 by Sean Conner. All Rights Reserved.