At The Enterprise, QA asked if they could have a tool that starts all our stuff up so they can do some performance tests (there are reasons they're asking for this, and why I agree with them that go beyond the scope of this entry). I replied I would see what I could do—it can't any harder than what I've done so far. And I came across an interesting bug.
The program will take our existing test cases, generate all the data and output a list of all the phone numbers so QA can use whatever they use to generate appropriate traffic. Then it will start up all the appropriate programs and just sit there, monitoring the processes such that if any stop, it stops the rest of them. And then QA can run whatever they run to inject requests into the maelstrom at whatever rate they see fit.
The bug in question: due to how the code was being written,
I was slowly moving code to catch two signals,
SIGINT (the interrupt signal) and
SIGCHLD (a child process has terminated) closer and closer to the start of the program
(for various reasons not germane to this entry).
At one point,
the program was always stopping because it thought one of the programs being tested has crashed when it hadn't.
I was able to isolate it—this code:
local tests = load_tests(arg) signal.catch('child') signal.catch('int')
signal.catch('child') signal.catch('int') local tests = load_tests(arg)
I then had a look at
load_tests() so see what in the world might be going on,
when I saw this:
os.execute("/bin/rm -rf dump/") -- other code local foo = io.popen("mkfoo lnp.dat","w") local bar = io.popen("mkbar sup.dat","w") -- other code
I was executing other programs to generate the data,
and those processes exiting were sending
SIGCHLD that the program
were not expecting.
Huh … leaking abstractions for the bugs!