Thursday, February 12, 2015
Sixteen hours to change three lines of code
The bug report was simple enough: “Project: Sippy-Cup” would, when it had to retransmit the data (remember, we're running over UDP), it would send an additional packet.
Sippy Cup | direction | Client |
---|---|---|
DATA | → | (packet dropped) |
DATA | → | (packet received) |
(packet recieved) | ← | ACK |
DATA | → |
So far, the bugs found in “Project: Sippy-Cup” have been easy to reproduce (thankfully!) and easy to fix once identified. This one was easy to reproduce … and took two days of intense effort to fix. The bug is not a show-stopper, and the client should (there's that word) handle the duplicate packet. My manager, S, was subtly hinting that I should put this on the backburner, but this is one of those problems that just irks me.
I added logging statements. I instrumented the code. I traced the
execution of the program line-by-line (easy enough to do within Lua as there's a pre-defined
method for hooking into the execution engine, and even though this didn't
reveal the root cause of the bug, it did reveal a few places where
some simple changes optimized the code (but that's for another post). For
example, here's all the code that's
executed when you run the luarocks
command, but fair
warning—it's nearly 7,000 lines long) and it became clear that
something was wrong with the timeout handling (what I didn't
mention the other day about scheduling
coroutines is how I handle timeouts, but alas, that's for another post
as well). The code is basically:
request = receive_packet_from_client() send_ack() send_request_to_db(request) -- ------------------------------------------------------------------ -- This sets things up to force the coroutine to resume in one second, -- returning a result of nil. -- ------------------------------------------------------------------ timeout(1) -- ---------------------------------------------------------------- -- the code that resumes this coroutine will pass in the result in -- coroutine.resume(), which is assigned to dbresult -- ---------------------------------------------------------------- dbresult = coroutine.yield() -- ------------------------------------------------------------------------ -- cancel a pending timeout, if any. If we did timeout, this is a NOP. If -- we haven't timed out, this prevents the timeout from triggering. -- ------------------------------------------------------------------------ timeout(0) if not dbresult then info = failed() else info = extract(dbresult) end for retries = 1 , 3 do send_information_to_client() timeout(1) ack = coroutine.yield() timeout(0) if ack then return end end
I was able to determine that the initial “timeout” was the request to the database engine timeout trigger (after hacking in a way to track the timeout requests) which meant the timeout code wasn't working properly. Once that was clear, I was able to locate the problem code:
-- ------------------------------------------------------------------------ -- TQUEUE contains references to coroutines that want a timeout. If we set -- the timeout to 0, this just sets the awake field to 0, which informs the -- coroutine scheduler not to run the coroutine when it "times out". -- ------------------------------------------------------------------------ function timeout(when) local f = coroutine.running() if when == 0 then if TQUEUE[co] then TQUEUE[co].awake = 0 -- oops, bug. See this post for details end else TQUEUE:insert(when,f) end end
Here we have an instance where a
static compiler would have flagged this immediately. I either need to
rename f
to co
, or change co
to
f
. A simple two line fix (and I really need to find a
code checker for Lua).
The other issue, now quickly found, was this bit of code:
local function handler(info,err,remote,message,startime) ... local ninfo,err,nremote,nmessage = coroutine.yield() timeout(0) if err == 0 then if not message.request.status then -- error else return nmessage end else ...
Even a static compiler would not have found this issue—I was checking the wrong variable! The code should read:
if not nmessage.request.status then
Bad me for using a bad name for a variable. No cookie for me!
Okay, this is a simple one line change. Total lines changed: 3.