Wednesday, January 29, 2020
“This is an invalid protocol because I can't open a file”
TS2 comes to my desk at the Ft. Lauderdale Office of the Corporation. “I'm running a load test of ‘Project: Cleese’ and it's not functioning. If I run a normal test, it runs fine.”
“Hmm … let me take a look.” I head back to TS2's desk. Sure enough, “Project: Cleese” is crashing under load. Well, not a hard crash—it is written in Lua and what's crashing are individual coroutines due to an uncaught error (the Lua equivalent of exceptions) where the only information being reported is “invalid protocol.” I have TS2 send me copies of the data files and script he's using to load test, and I'm able to reproduce the issue. It's an odd problem, because it appears to be crashing on this line of code:
local sock,err = net.socket(addr.family,'tcp')
I dive in,
and isolate the issue to this bit of C code that's part of the net.socket()
function:
if (getprotobyname_r(proto,&result,tmp,sizeof(tmp),&presult) != 0) return luaL_error(L,"invalid protocol");
Odd, “tcp” is a valid protocol, so I shouldn't be getting ENOENT
,
and the buffer used to store data is large enough
(because normally it works fine)
so I don't think I'm getting ERANGE
.
And that covers the errors that getprotobyname_r()
is documented to return.
I add some logging to see what error I'm actually getting.
I'm getting “Too many open files” and it suddenly all makes sense.
getprotobyname_r()
is using some data file
(probably /etc/protocols
)
to translate “tcp” to the actual protocol value but it can't open the file because the program is out of available file descriptors.
“Project: Cleese” is out of file descriptors because each network connection counts as a file descriptor,
and the test systems
(Linux in this case)
only allow 1,024 descriptors per process.
It's easy enough to up that to some higher value
(I did 65,536) and sure enough,
the “Too many open files” error starts showing up where I expect it to.
On the plus side, it's not my code. On the minus side, you have to love those leaky abstractions (and perhaps relying upon documentation a bit too much).