The Boston Diaries

The ongoing saga of a programmer who doesn't live in Boston, nor does he even like Boston, but yet named his weblog/journal “The Boston Diaries.”

Go figure.

Thursday, February 12, 2015

Sixteen hours to change three lines of code

The bug report was simple enough: “Project: Sippy-Cup” would, when it had to retransmit the data (remember, we're running over UDP), it would send an additional packet.

Data flow for “Project: Sippy-Cup”
Sippy CupdirectionClient
DATA(packet dropped)
DATA(packet received)
(packet recieved)ACK
DATA 

So far, the bugs found in “Project: Sippy-Cup” have been easy to reproduce (thankfully!) and easy to fix once identified. This one was easy to reproduce … and took two days of intense effort to fix. The bug is not a show-stopper, and the client should (there's that word) handle the duplicate packet. My manager, S, was subtly hinting that I should put this on the backburner, but this is one of those problems that just irks me.

I added logging statements. I instrumented the code. I traced the execution of the program line-by-line (easy enough to do within Lua as there's a pre-defined method for hooking into the execution engine, and even though this didn't reveal the root cause of the bug, it did reveal a few places where some simple changes optimized the code (but that's for another post). For example, here's all the code that's executed when you run the luarocks command, but fair warning—it's nearly 7,000 lines long) and it became clear that something was wrong with the timeout handling (what I didn't mention the other day about scheduling coroutines is how I handle timeouts, but alas, that's for another post as well). The code is basically:

request = receive_packet_from_client()
send_ack()
send_request_to_db(request)

-- ------------------------------------------------------------------
-- This sets things up to force the coroutine to resume in one second,
-- returning a result of nil.
-- ------------------------------------------------------------------

timeout(1)

-- ----------------------------------------------------------------
-- the code that resumes this coroutine will pass in the result in
-- coroutine.resume(), which is assigned to dbresult
-- ----------------------------------------------------------------

dbresult = coroutine.yield()

-- ------------------------------------------------------------------------
-- cancel a pending timeout, if any.  If we did timeout, this is a NOP.  If
-- we haven't timed out, this prevents the timeout from triggering.
-- ------------------------------------------------------------------------

timeout(0)

if not dbresult then
  info = failed()
else
  info = extract(dbresult)
end

for retries = 1 , 3 do
  send_information_to_client()
  timeout(1)
  ack = coroutine.yield()
  timeout(0)
  if ack then
    return
  end
end

I was able to determine that the initial “timeout” was the request to the database engine timeout trigger (after hacking in a way to track the timeout requests) which meant the timeout code wasn't working properly. Once that was clear, I was able to locate the problem code:

-- ------------------------------------------------------------------------
-- TQUEUE contains references to coroutines that want a timeout.  If we set
-- the timeout to 0, this just sets the awake field to 0, which informs the
-- coroutine scheduler not to run the coroutine when it "times out".  
-- ------------------------------------------------------------------------

function timeout(when)
  local f = coroutine.running()

  if when == 0 then
    if TQUEUE[co] then
      TQUEUE[co].awake = 0 -- oops, bug.  See this post for details
    end
  else
    TQUEUE:insert(when,f)
  end
end

Here we have an instance where a static compiler would have flagged this immediately. I either need to rename f to co, or change co to f. A simple two line fix (and I really need to find a code checker for Lua).

The other issue, now quickly found, was this bit of code:

local function handler(info,err,remote,message,startime)
  ... 
  local ninfo,err,nremote,nmessage = coroutine.yield()
  timeout(0)

  if err == 0 then
    if not message.request.status then
      -- error
    else
      return nmessage
    end
  else 
  ...

Even a static compiler would not have found this issue—I was checking the wrong variable! The code should read:

if not nmessage.request.status then

Bad me for using a bad name for a variable. No cookie for me!

Okay, this is a simple one line change. Total lines changed: 3.

Obligatory Picture

[“Only the highest fidelity images are used for identification purposes!”

Obligatory Links

Obligatory Miscellaneous

You have my permission to link freely to any entry here. Go ahead, I won't bite. I promise.

The dates are the permanent links to that day's entries (or entry, if there is only one entry). The titles are the permanent links to that entry only. The format for the links are simple: Start with the base link for this site: http://boston.conman.org/, then add the date you are interested in, say 2000/08/01, so that would make the final URL:

http://boston.conman.org/2000/08/01

You can also specify the entire month by leaving off the day portion. You can even select an arbitrary portion of time.

You may also note subtle shading of the links and that's intentional: the “closer” the link is (relative to the page) the “brighter” it appears. It's an experiment in using color shading to denote the distance a link is from here. If you don't notice it, don't worry; it's not all that important.

It is assumed that every brand name, slogan, corporate name, symbol, design element, et cetera mentioned in these pages is a protected and/or trademarked entity, the sole property of its owner(s), and acknowledgement of this status is implied.

Copyright © 1999-2018 by Sean Conner. All Rights Reserved.