Okay, what failed this time?

Thursday, January 19, 2012

I'm running the regression tests for “Project: Wolowizard” and about half way through the tests (around the two hour mark or so) start failing. Sometimes expected results just aren't showing up. I'm freaking about a bit because of all the issues we've had in running these tests, only for it to start failing in yet a different way.

Now, a bit about how this all works—there are four computers involved; one runs the tests, injecting messages towards a mini-cluster of two machines, either of which (depending on which one gets the message) sends a message to the fourth machine, which does a bunch of processing (which may involve interaction with a simulated cell phone on the testing machine), then responds back to the mini-cluster, which then responds back to the testing machine.

Now, I can check the immedate results from the mini-cluster, but the actual data I'm interested in is logged via syslog, so I have that data forwarded to the testing machine and my code grovels through a log file for the actual data I want. And it's that data (or part thereof) that apparently isn't being logged, and thus, the tests are failing.

Now, it just so happens that the part of the test that's failing is the part dealing with the mini-cluster, and it looks like about half the tests are failing (hmm …. ).

I log into each of the two computers comprising the mini-cluster, and check /etc/syslog.conf, in the off chance that changed. Nope. I then explain the problem to Bunny, standing (or rather, sitting) in as my cardboard programmer when it hits me—I should check to see if the program is running.

Rats. It is.

The tests are still failing, and my shoes began to squeak.

Okay, just because syslogd is running doesn't necessarily mean it's running correctly. So I run logger -p local1.info FOO on each machine and yes, one of the machines is failing to foward the logs to the testing machine.

Ahah!

I restart syslogd on that system, and lo! The log entries are getting through now.

You know, I expect there to be issues with the stuff I'm testing; what I don't expect is the stuff that we didn't write is having issues (the Protocol Stack From Hell™ notwithstanding).

Okay, reset everything and start the regression test over again …

Update in the wee-hours of the morning, Friday, January 20^th, 2012

A bit over half-way through the regression tests, and the log files rotate. Aaaaaaaaaah! Okay, reset all the data, and start from the last failed test. That's easy, since I can specify which cases to run. That's hard, because I have to specify nearly a 100 cases. That's easy, since I can use the Unix command seq to list them. That's hard, because the test cases aren't just numbers, but things like “1.b.77” and “1.c.18”, and while the shell supports command line expantion from a running program via the backtick (ala for i in `seq 34 77`; do echo 1.b.$i; done) I need to nest two such operations (echo `for i in `seq 34 77`;do echo 1.b.$i; done`) to specify the test cases from the command line, and the command line doesn't support that. Okay, I can create a temporary file that lists the test cases …

The Boston Diaries

Thursday, January 19, 2012