The Boston Diaries

Sunday, February 02, 2025

Artisanal code to solve an issue only I have

Update on Tuesday, February 4^th, 2025

The code presented below has a bug that has been fixed.. The code linked to below contains the current fixed code. That is all.

Now on with the original post …

I'm still using my ISP, despite the repeated letters that my service will go away at some point. But in the meantime, they keep reissuing a new IP address every so often just to reiterate their dedication to their serving up a dynamic IP address at no addtional cost to me. One of the casualties of their new policy is the monitoring of the system logs on my public server. I used to send syslog output from my public server to my development system at home, just to make it easier to keep an eye on what's happening. No more.

What I needed was a reverse-proxy type of thing—where the client (my development machine) connects to the server, then the server sends a copy of the logs down the connection. A separate program would be easier to write then to modify the exiting syslog daemon I'm using. It was a simple matter of telling the syslog daemon to forward a copy of all the logs to another program on the same system. Then I just had to write that program. To begin with, I need to load some modules:

local syslog  = require "org.conman.syslog"
local signal  = require "org.conman.signal"
local errno   = require "org.conman.errno"
local tls     = require "org.conman.nfl.tls"
local nfl     = require "org.conman.nfl"
local net     = require "org.conman.net"

The nfl module is my “server framework” for network based servers. Each TCP or TLS connection will be run on its own Lua thread, making the code easier to write than the typical “callback hell” that seems to be popular these days. I still need to make some low-level network calls, so I need the net module as well.

On to the configuration:

local SYSLOG  = "127.0.0.1"
local HOST    = "brevard.conman.org"
local CERT    = "brevard.conman.org.cert.pem"
local KEY     = "brevard.conman.org.key.pem"
local ISSUER  = "/C=US/ST=FL/O=Conman Laboratories/OU=Security Division/CN=Conman Laboratories CA/emailAddress=ca@conman.org"
local clients = {}

I didn't bother with a configuration file. This whole code base exists to solve an issue I have as simply as possible. At this point, a configuration file is overkill. The SYSLOG variable defines the address this server will use to accept output from syslog. Due to the way my current syslog daemon works, the port number it uses to forward logs is hard coded, so no need to specify the port. I'm going to run this over TLS because, why not? The tls module makes it easy to use, and it will make authentication trivial for this program. The CERT and KEY are the certificates needed, and these are generated by some scripts I wrote to play around with running my own simple certificate authority. My server is set to accept certificates signed by my simple certificate authority, which you can see in the definition of the ISSUER variable.

The clients variable is to track the the clients that connect to collect syslog output. Even though I'll only ever have one client, it's easy enough to make this an array.

local laddr = net.address(SYSLOG,'udp',514)
local lsock = net.socket(laddr.family,'udp')
lsock:bind(laddr)

nfl.SOCKETS:insert(lsock,'r',function()
  local _,data,err = lsock:recv()
  if data then
    for co in pairs(clients) do
      nfl.schedule(co,data)
    end
  else
    syslog('error',"recv()=%s",errno[err])
  end
end)

And now we create the local socket to receive output from syslog, and then add the socket to a table of sockets the framework uses, telling it to handle “read-ready” events. The data is read and then for each thread (Lua calls them “coroutines”) in the clients list, we schedule said thread to run with the data received from syslog.

local okay,err = tls.listen(HOST,514,client_main,function(conf)
  conf:verify_client()
  return conf:keypair_file(CERT,KEY)
     and conf:protocols("tlsv1.3")
end)

if not okay then
  syslog('error',"tls.listen()=%s",err)
  os.exit(1,true)
end

signal.catch('int')
signal.catch('term')

nfl.server_eventloop(function() return signal.caught() end)
os.exit(0,true)

And before we get to the routine that handles the clients, this is the code that creates a listening socket for TLS connections. We configure the listening socket to require the client send a certificate of its own (this is half of the authentication routine) and the certificates required to secure the connection, and the minimum protocol level. There's some error checking, setting up to catch some signals, then we start the main loop of the framework, which will terminate upon receiving a SIGINT (interrupt) or SIGTERM (terminate).

And finally, the code that runs on each TLS connection:

local function client_main(ios)
  ios:_handshake()
  
  if ios.__ctx:peer_cert_issuer() ~= ISSUER then
    ios:close()
    return
  end
  
  syslog('info',"remote=%s",ios.__remote.addr)
  clients[ios.__co] = true
  
  while true do
    local data = coroutine.yield()
    if not data then break end
    local okay,errmsg = ios:write(data,'\n')
    if not okay then
      syslog('error',"tls:read() = %s",errmsg)
      break
    end
  end
  
  syslog('info',"remote=%s disconnecting",ios.__remote.addr)
  clients[ios.__co] = nil
  ios:close()
end

The handshake is required to ensure that the client certificate is fully sent before we can check the issuer of said certificate. This is the extent of my authentication—I check that the certificate is issued from my simple certificate authority and not just any random but valid certificate being presented. Yes, there is a chance someone could forge a certificate claiming to be from my simple certificate authority, but to get such a certificate, some real certificate authority would need to issue someone else a certificate that maches the issuer on my certificates. I'm not seeing that happening any time soon (and if that happens, there are bigger things I need to worry about).

Once I've authenticated the certificate, I then pause the thread, waiting for data from the UDP socket (see above). If there's no data, then the client has dropped the connection and we exit out of the loop. We then write the data from syslog to the client and if that fails, we exit out of the loop.

Once out of the loop, we close the connection and that's pretty much all there is to it.

Yes, I realize that the calls to syslog() will be sent to the syslog daemon, only to be passed back to this program, but at least there's a log of this on the server.

I should also note that I do not attempt to track which logs have been sent and which haven't—that's a deliberate design decision on my part and I can live with missing logs on my development server. The logs are still recorded on the server itself so if it's important, I still have them, and this keeps this code simple.

The client code on my development server is even simpler:

local clock  = require "org.conman.clock"
local signal = require "org.conman.signal"
local tls    = require "org.conman.net.tls"
local net    = require "org.conman.net"

local SYSLOG = "192.168.1.10"
local HOST   = "brevard.conman.org"
local CERT   = "/home/spc/projects/CA/ca/intermediate/certs/sean.conner.cert.pem"
local KEY    = "/home/spc/projects/CA/ca/intermediate/private/sean.conner.key.pem"

Again, load the required modules, and configure the program. Much like the server, having a configuration file for this is way overkill, thus the above variables.

signal.catch('int')
signal.catch('term')

local addr = net.address(SYSLOG,'udp',514)
local sock = net.socket(addr.family,'udp')

connect(sock,addr)

The code sets up some signal handlers, creates a socket to send the data to syslog and calls the main function.

local function connect(sock,addr)
  local ios,err = tls.connect(HOST,514,function(conf)
  return conf:keypair_file(CERT,KEY)
     and conf:protocols("tlsv1.3")
  end)
  
  if not ios then
    io.stderr:write("Failure: ",err," retrying in a bit ...\n")
    clock.sleep(1)
  else
    io.stderr:write("\n\n\nConnected\n")
    main(ios,sock,addr)
  end
  
  if not signal.caught() then
    return connect(sock,addr)
  end
end

The connect() function tries to connect to the server with the given certificates. If it fails (and I expect this to happen when I get reassigned an IP address) it waits for a bit and retries again. If the connection succeeds though:

local function main(ios,sock,addr)
  for data in ios:lines() do
    if signal.caught() then
      ios:close()
      os.exit(0,true)
    end
    sock:send(addr,data)
  end  
  ios:close()
end

The code just loops, reading lines from the server and then sending them directly to the syslog daemon. Any errors (like the IP address got reassigned so the connection drops) the loop ends, we close the connection and return, falling into the retry loop in the connect() function.

In case anyone is interested, here's the source code for the server and the client.

And now some metacommentary on the artisanal code I just wrote

When I wrote the two programs to retrieve output from syslog from my public server, the thing I did not do use was any AI program (aka Cat) to help with the design nor the code. It was a simple problem with a straightforward solution and it's sad to think that more and more programmers are reaching for Cat for even simple programs.

I wonder though—is the popularity of Cat because of business demands that incentivize quick hacks to get new features and/or bug fixes and deincentivize deep knowledge or methodical implementations? Because of the constant churn of frameworks du jour and languages with constantly changing implementations? Because of sprawling code bases that not a single person can understand as a whole? Because businesses want to remove expensive programmers who might say “no”?

Anyway …

I don't expect the code I wrote to be of use for anyone else. The issue I'm solving is probably unique to me (and to the death of the true peer-to-peer Internet but I digress). But I also feel that such simple programs, ones that can be thought of as “disposable” almost, are not popular these days.

Although I'll admit that could be just a bias I'm forming from some forums I hang out on. These programs are too simple, there's no need for Docker (which is what? A tar file with some required files for use with a custom program to get around the fact that shared libraries are a mess?) or Kubernetes (which is what? A Google project Google doesn't even use but convinced enough people it's required to run at Google Scale?). Yeah, there are a few dependencies but not the hundreds you would get from using something like NodeJS.

I don't know … I'm just … in a mood.

Tuesday, February 04, 2025

Concurrency is tricky

As I was writing the previous entry I got the nagging feeling that something wasn't quite right with the code. I got distracted yesterday helping a friend bounce programming issues off me, but after that, I was able to take a good look at the code and figured out what I did wrong.

Well, not “wrong” per se, the code as it worked—it's just that it could fail catastrophically in the right conditions (or maybe wrong conditions, depending upon your view).

But first, a bit about how my network sever framework works. The core bit of code is this:

local function eventloop(done_f)
  if done_f() then return end

  -- calculate and handle timeouts
  -- each coroutine that timed out is
  -- scheduled to run on the RUNQUEUE,
  -- with nil and ETIMEDOUT.

  SOCKETS:wait(timeout)
  for event in SOCKETS:events() do
    event.obj(event)
  end

  while #RUNQUEUE > 0 do
    -- run each coroutine in the run queue until
    -- it eithers yields or returns (meaning
    -- it's finished running).
  end

  return eventloop(done_f)
end

Details are emitted full gory details here but in general, the event loop calls a passed in function to check if we need to shut down, then calculates a timeout value while checking for coroutines that registered a timeout. If any did, we add the coroutine to a run queue with nil and ETIMEDOUT to inform the resuming coroutine that it timed out. Then we scan a set of network sockets for activity with SOCKETS:wait() (on Linux, this ends up calling epoll_wait(); BSDs with kqueue() and most other Unix systems with poll()). We then call the handling function for each event. These can end up creating new coroutines and scheduling coroutines to run (these will be added to the run queue). And then for each coroutine in the run queue, we run it. Lather, rinse, repeat. Simple enough.

Now, on to the code I presented. This code registers a function to run when the given UDP socket recieves a packet of data, and schedules a number of coroutines waiting for data to run. This happens in the eventloop() function.

nfl.SOCKETS:insert(lsock,'r',function()
  local _,data,err = lsock:recv()
  if data then
    for co in pairs(clients) do
      nfl.schedule(co,data) -- problem
    end
  else
    syslog('error',"recv()=%s",errno[err])
  end
end)

I've noted a problematic line of code here.

And now the core of the routine to handle a TLS connection. This code yields to receive data, then writes the data to the TLS socket.

  while true do
    local data = coroutine.yield()
    if not data then break end
    local okay,errmsg = ios:write(data,'\n') -- <<< HERE
    if not okay then
      syslog('error',"tls:read() = %s",errmsg)
      break
    end
  end

I've marked where the root cause lies, and it's pretty subtle I think. The core issue is that ios:write() here could block, because the kernel output buffer is full and we need to wait for the kernel to send it. But the code that handles the UDP socket just assumes that the TLS coroutine is ready for more data. If ios:write() blocks and more UDP data comes on, the coroutine is prematurely resumed with the data, but that's just taken by the TLS thread as the write being successful, then yielding and then things get … weird, as the UDP side and the TLS side are now out of sync with each other. This, fortunately, hasn't trigger on me.

Yet.

It could, if too much was being logged to syslog. I wrote the following code to test it out:

#include <syslog.h>

#define MSG " !\"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXUZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~"

int main(void)
{
  int i;
  for (i = 0 ; i < 500 ; i++)
    syslog(LOG_DEBUG,"%3d " MSG MSG MSG,i);
  return 0;
}

And sure enough, the ~~spice~~ data stopped flowing.

What I needed to do was queue up the log messages to a given client, and only schedule it to run when it's waiting for more data. A few failed attempts followed—they were all based on scheduling the TLS thread when X number of messages were queued up (I tried one, then zero; neither worked). It worked much better by using a flag to indicate when the TLS coroutine wanted to be scheduled or not.

The UDP socket code is now:

nfl.SOCKETS:insert(lsock,'r',function()
  local _,data,err = lsock:recv()
  if data then
    for co,queue in pairs(clients) do
      table.insert(queue,data)
      if queue.ready then
        nfl.schedule(co,true)
      end
    end
  else
    syslog('error',"recv()=%s",errno[err])
  end
end)

The client list now contains a list of logs to send, along with a flag that the TLS coroutine sets indicating if it needs running or not. This takes advantage of Lua's tables which can have a hash part (named indices) and an array part, so we can include a flag in the queue.

And now the updated TLS coroutine:

local function client_main(ios)
  local function main()
    while #clients[ios.__co] > 0 do
      local data     = table.remove(clients[ios.__co],1)
      local okay,err = ios:write(data,'\n')
      if not okay then
        syslog('error',"tls:write()=%s",err)
        return
      end
    end
    
    clients[ios.__co].ready = true
    if not coroutine.yield() then
      return
    end
    clients[ios.__co].ready = false
    return main()
  end
  
  ios:_handshake()
  
  if ios.__ctx:peer_cert_issuer() ~= ISSUER then
    ios:close()
    return
  end
  
  syslog('info',"remote=%s",ios.__remote.addr)
  clients[ios.__co] = { ready = false }
  main()
  clients[ios.__co] = nil
  syslog('info',"remote=%s disconnecting",ios.__remote.addr)
  ios:close()
end

The core of the routine, the nested function main() does the real work here. When main() starts, the flag for queue readiness is false. It then runs through its input queue sending data to the client. Once that is done, it sets the queue readiness flag to true and then yields. Once it resumes, it sets the queue readiness flag to 'false' and (through a tail call) starts over again.

This ensures that logs are queued properly for delivery, and running the C test program again showed it works.

Tuesday, February 11, 2025

Two videos on how we figured out our solar system just based on obversations alone, long before we left the surly bonds of Earth

The video “Terence Tao on how we measure the cosmos” was very interesting to watch, as Terence goes into depth on how people in the past, and by past, I mean the distant past, figured out the earth was a sphere, how big that sphere was, and even reasoned that the earth went around the sun, long before the Christian Church even existed! And the method that Kepler used to figure out the orbits of Earth and the planets, when at the time we didn't quite know the distance to them, and all we had were positions in the sky to go by.

Incredible.

Also, a second video on how the moons of Jupiter (yes, it's not at all about Pluto despite the title) revealed much about how our solar system works. It even revealed that light had a finite speed.

I think if these methods were more widely known, how we figured out the shape of the Earth, the size of the moon and sun, and how orbits worked, then people wouldn't have the mistaken belief of a flat earth holding up the firmaments.

Update on Tuesday, March 18^th, 2025

Part Two of “Terence Tao on how we measure the cosmos” has been released.

I never got the memo on “copyover servers”

There’s only so much you can do with builder rights on someone else’s MUD. To really change the game, you needed to be able to code, and most MUDs were written “real languages” like C. We’d managed to get a copy of Visual C++ 6 and the CircleMUD source code, and started messing about. But the development cycle was pretty frustrating — for every change, you had to recompile the server, shut it down (dropping everyone’s connections), bring it back up, and wait for everyone to log back in.

Some MUDs used a very cool trick to avoid this, called “copyover” or “hotboot”. It’s an idiom that lets a stateful server replace itself while retaining its PID and open connections. It seemed like magic back then: you recompiled the server, sent the right command, everything froze for a few seconds, and (if you were lucky) it came back to life running the latest code. The trick is simple but I can’t find a detailed write-up, so I wanted to write it out while I thought of it.

Via Lobsters, How Copyover MUD Servers Worked | Blog | jackkelly.name

Somehow, in all my years of programming (and the few years I was looking into the source code of various MUDs back in the early 90s) I never came across this method of starting an updated version of a server without losing any network connections. In hindsite, it's an obvious solution—it just never occured to me to do this.

Discussions about this entry

Lazy Reading for 2025/02/23 – DragonFly BSD Digest

Sunday, February 02, 2025

Artisanal code to solve an issue only I have

Update on Tuesday, February 4^th, 2025

And now some metacommentary on the artisanal code I just wrote

Tuesday, February 04, 2025

Concurrency is tricky

Tuesday, February 11, 2025

Two videos on how we figured out our solar system just based on obversations alone, long before we left the surly bonds of Earth

Update on Tuesday, March 18^th, 2025

I never got the memo on “copyover servers”

Discussions about this entry

Obligatory Picture

Obligatory Contact Info

Obligatory Feeds

Obligatory Links

Obligatory Miscellaneous

Obligatory AI Disclaimer

The Boston Diaries

Update on Tuesday, February 4th, 2025

Update on Tuesday, March 18th, 2025

Discussions about this entry

Obligatory Picture

Obligatory Contact Info

Obligatory Feeds

Obligatory Links

Obligatory Miscellaneous

Obligatory AI Disclaimer

Update on Tuesday, February 4^th, 2025

Update on Tuesday, March 18^th, 2025