The Boston Diaries

Thursday, February 05, 2015

Bug hunt

Today T, a fellow cow-order here at The Ft. Lauderdale Office of The Corporation, found a particularly nasty bug in “Project: Sippy-Cup”—two identical SIP packets caused the request to fail. It uses UDP for the network transport, so duplicate packets are a possibility (so are dropped packets, but that's already being handled properly), something that the code didn't handle properly.

Now, “Project: Sippy-Cup” is written in Lua. I only say this because of how I handle requests. The main loop of the program waits for network packets using poll() (or epoll_wait() if running under Linux; they're like calls to select() (which is all you ever see in Unix networking tutorials because everybody is basing the code off examples written in the early 90s but I digress) only less CPU intensive). When poll() returns, I either create a coroutine or resume a coroutine (depending on where we are in the request) to handle the request.

I did it this way, even though I have to manually switch coroutines, to make the processing of requests easier to follow:

function handle(request)
  if not validate(request)
    socket:send(request.sender,"400 Bad client!  No cookie for you!")
    return
  end
  
  socket:send(request.sender,"200 Good boy!  You're such a good boy!")
  
  local info    = lookup_details(request)	-- blocking call
  local contact = dns.lookup(request.contact)	-- blocking call
  
  socket:send(contact,info)
  local reply = socket:recv()			-- blocking call
  
  if not reply.okay then
    log("I messed up")
  end
end

rather than having all this logic spread across multiple functions where each function represents a step in the processing:

-- not shown, the code that receives packets and calls the appropriate
-- routine.

function handle_incoming(state)
  if not validate(state.request)
    socket:send(state.request.sender,"400 Bad client!  No cookie for you!")
    return 
  end
  
  socket:send(state.request.sender,"200 Good boy!  You're such a good boy!")
  return lookup_details -- next function to call
end

function lookup_details(state)
  state.info = -- code to send request
  return contact_dns
end

function contact_dns(state)
  state.contact = state.request.contact
  -- code to send a DNS request
  return got_dns_response
end

function got_dns_response
  socket:send(state.contact,state.info)
  return process_reply
end

function process_reply(state)
  if not state.reply.okay then
    log("I messed up")
  end
end

It's fairly straightforward—once I get a SIP packet, I check a list to see if I've already created a coroutine based on the unique ID included in the packet. If one does exist, I then resume that coroutine; otherwise, I create one. The coroutine will run until it either finishes, or a function will block (say, a DNS query), and the code that could block will add the coroutine to a list it maintains and yield back to the main code.

The main coroutine, at 50,000 feet:

poll() returns.
It's a SIP packet: process it:
1. Get the unique ID from the SIP packet.
2. Use the ID to check the list of SIP coroutines. If a coroutine does not exist, create one and add it to the list.
3. Resume the coroutine.
4. If the coroutine has ended, remove the reference from the list of SIP coroutines; otherwise
5. The coroutine has called coroutine.yield(), so keep the reference arond.
It's another type of packet (see below)
Go back to step 1.

Any routine that might block (and they're all network related) will put the coroutine on a list, and yield. So, when call dns.lookup(), that function does:

Do have have the data cached? If so, return it.
Otherwise, create the DNS request.
Send the DNS request to a DNS server.
Place the coroutine on a DNS list, indexed with the DNS request ID.
Call coroutine.yield(), which returns us to the main loop.
We resume here, so remove the coroutine from the DNS list.
Cache the data.
Return the data.

So, step 3 of the main coroutine contains these sub-steps:

Get the unique ID from the DNS packet.
Get the coroutine from the DNS list, indexed by the ID.
Resume that coroutine.

The lookup_detail() function (the handling of which would be step 4, which isn't shown) is similar to the DNS lookup. This all works and while the main driver code is a bit complex (and the complexity has to live somewhere), the actual processing code is nice and linear, making it easier to change the “business logic” in one place rather than have it scattered throughout the codebase.

Now we get to the bug—mishandling of duplicate packets. The first SIP packet comes in, has a unique ID, so a coroutine is created, we get to the call to lookup_details() and we yield, waiting for a reponse from our database engine (not a real database engine, but close enough). The duplicate SIP packet comes in. We already have a coroutine associated with this SIP ID, so re blindly resume it. The coroutine, freshly resumed, assumes the reply from the database has arrived and things go down hill from there (but only for that request; the program is still running and can handle new requests).

The solution is to realize we have a duplicate request and not resume the coroutine in that case.

And like most bugs, the fix is trivial—it's the hunt that's difficult.

Thursday, February 05, 2015

Bug hunt

Obligatory Picture

Obligatory Contact Info

Obligatory Feeds

Obligatory Links

Obligatory Miscellaneous

Obligatory AI Disclaimer