Wednesday, April 07, 2010
Dependencies and side effects
“Well, that's your problem,” I said, looking at my computer sitting there, powered off. Bunny had been unable to check her bank account and thinking our network connection was bad, powercycled the router and DSL unit, but was still unable to connect. The real issue was my computer being off—all the computers here at Chez Boca use my computer for DNS resolution.
“Yeah, something happened with the power, but I'm not sure what,” said
Bunny. And yes, it was odd; none of the clocks are blinking “12:00” and
from what it sounds like, she didn't hear any USP alarms (I think she would
have mentioned those if she did hear them) so something odd did happen. And
later (to get ahead of the story a bit) when I did check the UPS logs, I found (output
from syslogintr
):
/dev/log | apcupsd | daemon info | Apr 04 18:23:58 | 000.0,000.0,000.0,27.10,00.00,40.0,00.0,000.0,000.0,122.0,100.0,0 /dev/log | apcupsd | daemon info | Apr 04 18:29:00 | 000.0,000.0,000.0,26.60,00.00,40.0,00.0,000.0,000.0,120.0,100.0,1 /dev/log | apcupsd | daemon info | Apr 04 18:34:02 | 000.0,000.0,000.0,26.43,00.00,40.0,00.0,000.0,000.0,120.0,100.0,0 /dev/log | apcupsd | daemon info | Apr 04 18:39:05 | 000.0,000.0,000.0,26.27,00.00,40.0,00.0,000.0,000.0,120.0,100.0,1 /dev/log | apcupsd | daemon info | Apr 04 18:44:07 | 000.0,000.0,000.0,26.27,00.00,40.0,00.0,000.0,000.0,122.0,100.0,0 /dev/log | apcupsd | daemon info | Apr 04 18:49:10 | 000.0,000.0,000.0,26.27,00.00,40.0,00.0,000.0,000.0,121.0,100.0,1 /dev/log | apcupsd | daemon info | Apr 04 18:54:12 | 000.0,000.0,000.0,26.27,00.00,46.0,00.0,000.0,000.0,120.0,100.0,0 /dev/log | apcupsd | daemon info | Apr 04 18:59:15 | 000.0,000.0,000.0,26.27,00.00,45.0,00.0,000.0,000.0,120.0,100.0,1 /dev/log | apcupsd | daemon info | Apr 04 19:04:17 | 000.0,000.0,000.0,26.27,00.00,46.0,00.0,000.0,000.0,120.0,100.0,0 /dev/log | apcupsd | daemon info | Apr 04 19:09:20 | 000.0,000.0,000.0,26.10,00.00,46.0,00.0,000.0,000.0,122.0,100.0,1 /dev/log | apcupsd | daemon info | Apr 04 19:14:22 | 000.0,000.0,000.0,26.10,00.00,46.0,00.0,000.0,000.0,122.0,100.0,0 /dev/log | apcupsd | daemon info | Apr 04 19:19:24 | 000.0,000.0,000.0,26.10,00.00,46.0,00.0,000.0,000.0,121.0,100.0,1 /dev/log | apcupsd | daemon info | Apr 04 19:24:27 | 000.0,000.0,000.0,26.10,00.00,46.0,00.0,000.0,000.0,121.0,100.0,0 /dev/log | apcupsd | daemon info | Apr 04 19:29:29 | 000.0,000.0,000.0,26.10,00.00,46.0,00.0,000.0,000.0,122.0,100.0,1 /dev/log | apcupsd | daemon info | Apr 04 19:34:32 | 000.0,000.0,000.0,26.10,00.00,40.0,00.0,000.0,000.0,122.0,100.0,0 /dev/log | apcupsd | daemon info | Apr 04 19:39:34 | 000.0,000.0,000.0,26.10,00.00,40.0,00.0,000.0,000.0,121.0,100.0,1 /dev/log | apcupsd | daemon info | Apr 04 19:44:37 | 000.0,000.0,000.0,26.10,00.00,40.0,00.0,000.0,000.0,122.0,100.0,0 /dev/log | apcupsd | daemon info | Apr 04 19:50:36 | 000.0,000.0,000.0,26.10,00.00,40.0,00.0,000.0,000.0,121.0,099.0,1 /dev/log | apcupsd | daemon info | Apr 04 19:55:53 | 000.0,000.0,000.0,25.94,00.00,40.0,00.0,000.0,000.0,121.0,098.0,0 /dev/log | apcupsd | daemon info | Apr 04 20:00:56 | 000.0,000.0,000.0,25.94,00.00,40.0,00.0,000.0,000.0,122.0,098.0,1
So, starting just past 18:23:58 on April 4th something odd was happening to my UPS (which all the computers in The Home Office are hooked up to), causing the battery voltage (normally 27.1v) and battery charge (100.0 percent) to drop. And those values kept dropping until:
/dev/log | apcupsd | daemon info | Apr 07 12:31:05 | 000.0,000.0,000.0,25.60,00.00,36.0,00.0,000.0,000.0,119.0,084.0,0 /dev/log | apcupsd | daemon info | Apr 07 12:36:08 | 000.0,000.0,000.0,25.60,00.00,36.0,00.0,000.0,000.0,121.0,084.0,1 /dev/log | apcupsd | daemon info | Apr 07 12:41:10 | 000.0,000.0,000.0,25.60,00.00,36.0,00.0,000.0,000.0,119.0,084.0,0
The battery voltage fell to 25.6V and the battery charge to 84% and past that … well, nothing because the computers lost power at that point (or anywhere up til 12:46:12 when the next log should have appeared). So no real clues as to what happened with the power, but I digress—back to the story.
I hit the power button on my computer, it bitches about the disk being
corrupt (which I ignore, running a journaled filesystem) and when it gets to
starting up syslogd
(which is in reality my
syslogintr
) and klogd
(which logs messages from
the kernel to syslog()
) when it hangs.
Hmm, I thought. Perhaps I better let fsck
run,
just to be safe. Powercycle, hit 'Y' to let fsck
run for
about fifteen minutes, then watch as it hangs yet again on
syslogd/klogd
.
Now, I won't bore you with the next few hours (which basically involved
continously booting into single user mode and trying to puzzle out what was
going wrong) but in the end, the problem ended up being
syslogintr
.
Or rather, not with the actual C code in syslogintr
, but
with the Lua script it was running. It actually had to do with blocking
ssh
attempts via iptables
. See,
syslog/klogd
start up before the network is initialized (and by
extension, iptables
), and aparently, running
iptables
before the network and klogd
is
running really messes up the boot process in some odd way that locks the
system up (not that this is the first time I've seen such weird interactions
before—back in college I learned the hard way that under the right
circumstances (which happened all too often) screen
and IRC
under Irix4.0.5 would cause a kernel
panic, thus stopping the computer cold).
Once I figured that out, it was a rather simple process to
remove the ssh
blocking feature from the script, and now I'm
stuck with a weird dependency issue—or I can just remove entirely the
ssh
blocking code from syslogintr
(or at least the
Lua script it runs).
Sigh.
Notes on a conversation during an impromptu UPS test
“So what should the UPS do when the power goes out?” asked Bunny.
“It sounds an alarm, and I should have,” I said, turning to the keyboard and typing a command, “nine minutes of power.”
“Oh really?”
“Yes. Well, let's test it,” I said, getting up, and pulling the plug on the UPS. About five seconds later, it started beeping. “See?”
“Hmm … I see,” said Bunny. “And then what?”
“Well, it's enough time for either the power to come back up, or to shutdown the computers. You don't really need—”
Just then all our computers suddenly lost power.
“Oh, well that was interesting,” I said.
“I thought you said you had nine minutes.”
“Apparently, so did the UPS.”