Wednesday, Debtember 23, 2020
I solved the issue, but I'm not sure what the issue was
It's a bug whose solution I can't explain.
So I have a Lua module that enables event-driven programming.
I also have a few modules that drive TCP and TLS connections.
To make it even easier,
I have a module that presents a Lua file-like interface to network connections—functions like obj:read()
and obj:write()
.
The previous version of this interface module, org.conman.net.ios
,
used LPEG to handle line-based IO requests,
as well as an extension to read headers from an Internet message.
Given the overhead of LPEG I thought I might try using the built-in pattern matching of Lua.
I reworked the code and a bench mark did show a decent and measurable improvement in speed and memory usage.
But the new code failed when transferring a sizable amount of data (about 6.7M) over TLS. It took about two days to track down the problem, and I still don't have a root cause. The code works, but I don't know why it works. And that bugs me.
To further complicate matters, the code did work when I download the data from a server I wrote (using the same Lua code as the client) but it would fail when I tried downloading the data from another server (different TLS implementation, different language, etc.).
I was able to eventually isolate the issue down one function in org.conman.net.ios
.
Here was the original code:
local function write(ios,...) if ios._eof then return false,"stream closed",-2 end local output = "" for i = 1 , select('#',...) do local data = select(i,...) if type(data) ~= 'string' and type(data) ~= 'number' then error("string or number expected, got " .. type(data)) end output = output .. data end return ios:_drain(output) end
It works,
but I didn't like that it first accumulated all the output first before writing it.
So when I rewrote org.conman.net.ios
,
I modified the function thusly:
local function write(ios,...) if ios._eof then return false,"stream closed",-2 end for i = 1 , select('#',...) do local data = select(i,...) if type(data) ~= 'string' and type(data) ~= 'number' then error(string.format("bad argument #%d to 'write' (string expected, got %s) end data = tostring(data) local okay,err,ev = ios:_drain(data) if not okay then syslog('error',"ios:_drain() = %s",err) return okay,err,ev end ios._wbytes = ios._wbytes + #data end return true end
Instead of accumulating the data into one large buffer, it outputs it piecemeal. To further confound things, this doesn't appear to have anything to do with reading, which is what I was having issues with.
The client only did one call to this function:
local okay,err = ios:write(location,"\r\n")
The request went out, I would start receiving data, but for some odd reason, the connection would just drop about 200K short of the full file (it was never a consistent amount either).
While the reading side was a different implementation,
the writing side didn't have to be different,
I just felt the second vesion to be a bit better,
and it shouldn't make a difference, right?
[There's that word! –Editor]
[What word? –Sean]
[“Should.” –Editor]
But regardless of my feelings about how that certainly can't be at fault,
I put the previous version of write()
back and lo'!
It worked!
…
I'm flummoxed!
I don't understand why the new version of write()
would cause the TLS connection to eventually fail,
but it did,
for whatever reason.
Weird.