Wednesday, March 05, 2014
The Case Of The Missing Core Files
At work, I test the various components of “Project: Wolowizard.” These tests usually require
running multiple copies of a program on a single computer. I use Lua (with help from a module)
to start and monitor the programs being tested. The code starts
N
copies, and if any of the programs crash, the reason is
logged. It's fairly straight forward code.
Now, one of the compents of “Project: Wolowizard” was updated to support a new project (“Project: Sippy-Cup”) and that component is occasionally crashing on an assert, but the problem is: there are no core files to check.
And I've spent the past two days trying to figure out why there are no core files to check.
The first culprit—have we told the system not to generate core
files? Yup. The account under which the program runs (root
)
has a core file size limit of zero bytes. There are a few ways to fix this,
and I picked what to me, was the simplest solution: in the Lua script that
runs the programs, set the core file size to “unlimited.” And this is easy
enough to do:
proc = require "org.conman.process" proc.limits.hard.core = "inf" proc.limits.soft.core = "inf"
Slight digression: you can set various resource limits for things like
maximum memory usage to core file size. The hard limit normally can't be
changed, but the soft limit can—any process an lower a limit. But a
process running as root
can raise a limit, and raise the hard
limit. Since the program I'm running is running as root, setting both the
hard and soft limits to “infinity” is easy.
But there was still a disturbing lack of core files.
I checked the code of the Lua module I was using, and yes, I flubbed the parsing code. I made the fix, my tests showed I got the logic right, installed the updated module and still, no core files.
I did a bunch more tests and checked off the following reasons for the lack of core files: it wasn't because the program dropped permissions; it wasn't because the program couldn't write the core file in its current working directory; and the program is not setuid. It was clear there was something wrong the module.
I was able to isolate the issue to the following:
struct rlimit limit; lua_Number ival; /* ... */ if (lua_isnumber(L,3)) ival = lua_tonumber(L,3); /* ... */ if (ival >= RLIM_INFINITY) ival = RLIM_INFINITY; limit.rlim_cur = ival;
Now, lua_Number
is of type double
(a floating
point value), and imit.rlim_cur
is some form of integer.
ival
was properly HUGE_VAL
(the C floating point
equivalent of “infinity”) but limit.rlim_cur
was 0.
But it worked on my home system just fine.
Then it dawned on me—my home system was a 32-bit system! That was the
system I did the patch and initial test; the systems at work are all 64-bit
systems. Some digging revealed that the definition of
RLIM_INFINITY
on the 64-bit system was
((unsigned long int)(~0UL))
or in other words: the largest unsigned long integer value. And on a 64-bit system, an unsigned long integer is 64-bits in size.
I do believe I was bit by an IEEE 754 floating point implementation detail.
Lua treats all numbers as type double
, and on modern
systems, that means IEEE 754
floating point. A double
can
store 53-bit integers without loss, and on a 32-bit system, you can pass
integer values into and out of a double
without issue (32 being
less than 53) and because I did my initial testing of the Lua
module on a 32-bit system, there was no issue.
But on a 64-bit system … it gets interesting. Doing some empirical
testing, I found the largest integer value you can store into a
double
and get something out is
18,446,744,073,709,550,591 (and what you get out is
18,446,744,073,709,549,568—I'll leave the reason for the discrepancy for
the reader); anything larger, you get zero back out.
So, no wonder I wasn't getting any core files! I was inadvertantly setting the core file size to zero bytes!
Sigh.
Off to fix the code …