The Boston Diaries

I checked last night, and found the following:

CPU utilization after a week
	CPU	requests
gopher	28:37	8360
gemini	6:02	24345

How can a TCP service take more time than a TLS service? Especially when TLS runs over TCP? And TLS generates six times the packets than TCP?

There's some difference in how I handle TCP and TLS that's causing this. Profiling the code didn't reveal much as my attempts kept showing that gopher should be using less CPU than Gemini, so then it was just a matter of staring at code and thinking hard.

It wasn't easy. Yes, there are some differences between the TCP interface and the TLS interface, so it was a matter of going through the code, bringing them closer in implementation and seeing if that fixed things. I would fix the code, run a test, see it not make a difference, and try again. Eventually, I came across this bit of code:

-- ---------------------------------------------------------------------
-- It might be a bug *somewhere*, but on Linux, this is required to get
-- Unix domain sockets to work with the NFL driver.  There's a race
-- condition where writting data then calling close() may cause the
-- other side to receive no data.  This does NOT appoear to happen with
-- TCP sockets, but this doesn't hurt the TCP side in any case.
-- ---------------------------------------------------------------------
    
while self.__socket.sendqueue and ios.__socket.sendqueue > 0 do
  nfl.SOCKETS:update(self.__socket,'w')
  coroutine.yield()
end

This bit of code wasn't in the TLS implementation, and as the comment says, it “shouldn't” hurt the TCP path, but hey, it was only needed for local (or Unix) domain sockets to begin with, so let me try removing that bit of code and run a test.

Test before removing the code, 275 requests; time in seconds
	CPU	real	user	sys
gopher	18	46.726	13.971	4.059
gemini	4	107.928	3.951	0.322

Test after removing the code, 275 requests; time in seconds
	CPU	real	user	sys
gopher	0	52.403	0.635	0.185
gemini	4	103.290	3.957	0.285

THAT'S more like it!

Now, about that code … this is my best guess as to what was going on.

The sendqueue field is a Linux-only field, so I'm checking to see if it exists, and then checking the value (if only to avoid the “attempt to compare number with nil” Lua error on non-Linux systems). This returns the number of bytes the kernel has yet to send, so we switch the socket to trigger an event when we can write to it, and yield. Even worse, the check isn't just a simple variable fetch, but ends up being an ioctl() system call. Twice!

Now the “write” event code:

if event.write then
  if #ios.__output > 0 then
    local bytes,err = ios.__socket:send(nil,ios.__output)
    if err == 0 then
      ios.__wbytes = ios.__wbytes + bytes
      ios.__output = ios.__output:sub(bytes + 1,-1)
    else
      syslog('error',"socket:send() = %s",errno[err])
      nfl.SOCKETS:remove(ios.__socket)
      nfl.schedule(ios.__co,false,errno[err],err)
    end
  end
  
  if #ios.__output == 0 then
    nfl.SOCKETS:update(ios.__socket,'r')
    nfl.schedule(ios.__co,true)
  end
end

There's probably no data to be sent (ios.__output is empty) so we immediately switch the socket back to trigger for readability, then resume the coroutine handling the socket. It goes back to the while loop, setting the socket back to trigger on writability, and kept bouncing like that until the kernel buffer had been sent.

As I wrote, this fixed an issue I had with local domain sockets—it wasn't an issue with an IP socket, and in fact, was the wrong thing to do for an IP socket.

And as it is with these types of bugs, finding the root cause is not trivial, but the fix almost always is.

Sunday, June 30, 2024

Four lines of code … it was four lines of code

Discussions about this entry

Obligatory Picture

Obligatory Contact Info

Obligatory Feeds

Obligatory Links

Obligatory Miscellaneous