Thursday, February 05, 2015
Bug hunt
Today T, a fellow cow-order here at The Ft. Lauderdale Office of The Corporation, found a particularly nasty bug in “Project: Sippy-Cup”—two identical SIP packets caused the request to fail. It uses UDP for the network transport, so duplicate packets are a possibility (so are dropped packets, but that's already being handled properly), something that the code didn't handle properly.
Now, “Project: Sippy-Cup” is written in Lua. I only say this because of how I handle
requests. The main loop of the program waits for network packets using
poll()
(or epoll_wait()
if running
under Linux; they're like calls to select()
(which is all you ever see in Unix networking tutorials because everybody is
basing the code off examples written in the early 90s but I digress) only
less CPU intensive).
When poll()
returns, I either create a coroutine or resume a
coroutine (depending on where we are in the request) to handle the
request.
I did it this way, even though I have to manually switch coroutines, to make the processing of requests easier to follow:
function handle(request) if not validate(request) socket:send(request.sender,"400 Bad client! No cookie for you!") return end socket:send(request.sender,"200 Good boy! You're such a good boy!") local info = lookup_details(request) -- blocking call local contact = dns.lookup(request.contact) -- blocking call socket:send(contact,info) local reply = socket:recv() -- blocking call if not reply.okay then log("I messed up") end end
rather than having all this logic spread across multiple functions where each function represents a step in the processing:
-- not shown, the code that receives packets and calls the appropriate -- routine. function handle_incoming(state) if not validate(state.request) socket:send(state.request.sender,"400 Bad client! No cookie for you!") return end socket:send(state.request.sender,"200 Good boy! You're such a good boy!") return lookup_details -- next function to call end function lookup_details(state) state.info = -- code to send request return contact_dns end function contact_dns(state) state.contact = state.request.contact -- code to send a DNS request return got_dns_response end function got_dns_response socket:send(state.contact,state.info) return process_reply end function process_reply(state) if not state.reply.okay then log("I messed up") end end
It's fairly straightforward—once I get a SIP packet, I check a list to see if I've already created a coroutine based on the unique ID included in the packet. If one does exist, I then resume that coroutine; otherwise, I create one. The coroutine will run until it either finishes, or a function will block (say, a DNS query), and the code that could block will add the coroutine to a list it maintains and yield back to the main code.
The main coroutine, at 50,000 feet:
poll()
returns.- It's a SIP packet: process it:
- Get the unique ID from the SIP packet.
- Use the ID to check the list of SIP coroutines. If a coroutine does not exist, create one and add it to the list.
- Resume the coroutine.
- If the coroutine has ended, remove the reference from the list of SIP coroutines; otherwise
- The coroutine has called
coroutine.yield()
, so keep the reference arond.
- It's another type of packet (see below)
- Go back to step 1.
Any routine that might block (and they're all network related) will put
the coroutine on a list, and yield. So, when call
dns.lookup()
, that function does:
- Do have have the data cached? If so, return it.
- Otherwise, create the DNS request.
- Send the DNS request to a DNS server.
- Place the coroutine on a DNS list, indexed with the DNS request ID.
- Call
coroutine.yield()
, which returns us to the main loop. - We resume here, so remove the coroutine from the DNS list.
- Cache the data.
- Return the data.
So, step 3 of the main coroutine contains these sub-steps:
- Get the unique ID from the DNS packet.
- Get the coroutine from the DNS list, indexed by the ID.
- Resume that coroutine.
The lookup_detail()
function (the handling of which would be
step 4, which isn't shown) is similar to the DNS lookup. This all works and while the main driver
code is a bit complex (and the complexity has to live somewhere),
the actual processing code is nice and linear, making it easier to change
the “business logic” in one place rather than have it scattered throughout
the codebase.
Now we get to the bug—mishandling of duplicate packets. The first
SIP packet comes in,
has a unique ID, so a coroutine is created, we get to the call to
lookup_details()
and we yield, waiting for a reponse from our
database engine (not a real database engine, but close enough). The
duplicate SIP packet
comes in. We already have a coroutine associated with this SIP ID, so re blindly resume
it. The coroutine, freshly resumed, assumes the reply from the database has
arrived and things go down hill from there (but only for that request; the
program is still running and can handle new requests).
The solution is to realize we have a duplicate request and not resume the coroutine in that case.
And like most bugs, the fix is trivial—it's the hunt that's difficult.