Thursday, April 30, 2015
Some perils of handing signals in Lua
This is one of those “Oh, yeah, I didn't think that through, did I?” type of bug.
I wrote a signal module for Lua, which can handle both ANSI C and POSIX signals with largly the same API (the POSIX implementation one has some additional functions defined).
Handling signals in Lua is not that straightforward because of the nature of signals—you are effectively writting multithreaded code.
You just can't call back into Lua from the signal handler
(while the Lua VM has no static data and each Lua state is isolated unto itself,
two threads sharing a Lua state can lead to problemss).
The only Lua function you can safely call is lua_sethook(),
which can be used to stop the Lua VM at the next VM instruction
(it's typically used for debugging and signal handing).
This callback can then call back into Lua.
It is a bit convoluted
(the signal handler will call lua_sethook()
and return; the Lua VM will resume and then call the hook),
but it does allow you to write signal handlers in Lua:
signal.catch('windowchange',function() print("Wheeee! Our terminal just resized!") end)
and not have it blow up on you.
So, with that in mind, I give you this code:
local net = require "org.conman.net" local clock = require "org.conman.clock" local signal = require "org.conman.signal" local raddr = net.address("127.0.0.1",udp,'echo') local sock = net.socket(raddr.family,'udp') signal.catch('alarm',function() sock:send(raddr,tostring(clock.get())) end) clock.itimer(1) local previous = clock.get() while true do local _,data = sock:recv() local now = clock.get() if data then local zen = tonumber(data) print(string.format("%.7f\t%.7f",now - zen,now - previous)) previous = now end end
This is a UDP echo client program.
signal.catch()
handles the alarm signal (SIGALRM
) by sending a packet of data
(which is just the current time)
to the echo server.
clock.itimer()
informs the kernel to send the alarm signal once a second.
So once a second,
our program receives the alarm signal and sends the current time.
Then,
in an infinite loop,
we just wait for packets to arrive
(which should be the packets we sent to the echo server—they're “echoed” back to us)
and we calculate how long the packet took round trip and how long it was from the previous packet.
The output looks like:
0.0002971 1.0014961 0.0003922 0.9999950 0.0002851 0.9998930 0.0003171 1.0000319 0.0003910 0.9999740 0.0002551 0.9998641 0.0003359 1.0000808
The first column is the round trip time (in seconds) for the packet (around 3 to 4 ten thousandths of a second), and the second column is how long (in seconds) from the previous packet (a second, give or take a few ten thousandths).
But our call to sock:recv()
is interrupted by the alarm signal.
Unfortunately, one side effect of signals is that they will interrupt “long running” system calls,
which is almost always system calls dealing with I/O, such as read()
or write()
.
When such a call is interrupted,
the system call will return an error of EINTR
.
We can see this if we change the code a bit:
local net = require "org.conman.net" local clock = require "org.conman.clock" local signal = require "org.conman.signal" local errno = require "org.conman.errno" local raddr = net.address("192.168.90.118",'udp',22222) local sock = net.socket(raddr.family,'udp') signal.default('int') signal.catch('alarm',function() sock:send(raddr,tostring(clock.get())) end) clock.itimer(1) local previous = clock.get() while true do local _,data,err = sock:recv() local now = clock.get() if data then local zen = tonumber(data) print(string.format("%.7f\t%.7f",now - zen,now - previous)) previous = now else print(">>>",errno[err]) end end
and when we run it:
>>> Interrupted system call 0.0003049 1.0015509 >>> Interrupted system call 0.0002320 0.9998269 >>> Interrupted system call 0.0002131 0.9999812 >>> Interrupted system call 0.0001860 0.9999728 >>> Interrupted system call 0.0002639 0.9999781
With POSIX,
you can specify that for a given signal,
system calls are to be automatically restarted so you can dispense with EINTR
error handling.
And here's were we finally get to the “Oh, yeah, I didn't think that through, did I?” type of bug.
Not wanting the code to be interrupted by the alarm signal,
I changed the call to signal.catch()
so it would restart any system calls:
signal.catch('alarm',function() sock:send(raddr,tostring(clock.get())) end,'restart')
When I ran the code, I got nothing! There was simply no ouput happening. It caught me by surprise and it took me several minutes to figure out what was happening (or rather, what wasn't happening):
- We enter
sock:recv()
(a Lua function), which ultimately does a system call torecvfrom()
; - the alarm signal (
SIGALRM
) is raised; - the kernel iterrupts our current system call (
recvfrom()
) and calls the signal handler we installed; - the signal handler calls
lua_sethook()
so that the Lua VM (when it resumes) will be stopped and our Lua hook function (written in Lua) will execute; - the kernel then resumes the system call (
recvfrom()
) that was previously interrupted.
And thus we get to the punchline:
the Lua VM doesn't resume because we're still in a system call!
And thus,
the signal handler written in Lua is never called,
which doesn't send a packet,
because we're stuck in our system call (recvfrom()
) waiting for some data that will never arrive.
D'oh!
If the above code were written in C,
there would be no issue; clock_gettime()
and sendto()
(the system calls underlying the Lua functions clock.get()
and sock:send()
respectively)
are safe to call from a signal handler.
I may not have been able to safely convert the time to text (since snprintf()
—the only standard C function able to convert numbers to text,
isn't documented as being safe to call in a signal handler) but sending the raw binary values would be okay in that case.
But this isn't C, it's Lua. And what we have here is a type of leaky abstraction. That 20/20 hindsight is such a bastard.