Thursday, July 03, 2008
The Case of the Non-Mounted File System
I felt like I was in an episode of House.
I found myself at The Data Center, waiting for one of our customers, R, to show up to let him in (he forgot the access code). While there, I was attempting to extracate a KVM cable he could use when I pulled the wrong cable and unplugged a power strip.
The upshot: I took down some of R's equipment that wasn't having problems.
Sigh.
R shows up, and we check on the equipment that experienced the unplanned power outtage and one of his Linux boxes was in trouble. It was running Asterisk and it had the most amusing problem: it kept core dumping on an illegal instruction and upon crashing, would restart itself.
But in troubleshooting that problem, it became rather apparent something else was terribly wrong:
GenericUnixRootPrompt# df Filesystem 1K-blocks Used Available Use% Mounted on GenericUnixRootPrompt#
Nothing mounted, but I could still see files.
fdisk
showed two partitions, /dev/hda1
and
/dev/hda2
. fsck
worked fine on
/dev/hda1
but failed on /dev/hda2
since it didn't
know what type of filesystem was on it. Odder still, /dev/hda1
was the boot partition, containing only the kernel and related files
required for the initial operating system boot, but yet, here I was, in a
shell, running Unix commands like fsck
and fdisk
and more
.
Yet fsck
and even mount
had no idea what type
of filesystem was on /dev/hda2
.
Yet, it must be the root filesystem, which I was currently
using, because /dev/hda1
didn't have fsck
,
mount
, more
much less /bin/bash
.
Worse still, what I did have, including /tmp
, was in
“read-only” mode.
The Asterisk crashing problem would have to wait.
I was able to get the box on the network and backup everything to another
system. While that was chugging along (took about an hour) I realized that
the system was somehow mounting /dev/hda2
, otherwise there'd be
nothing to backup. Checking /etc/fstab
didn't help much:
GenericUnixRootPrompt# more /etc/fstab # This file is edited by fstab-sync - see 'man fstab-sync' for details /dev/hdb1 /media/cdrom auto user,noauto 0 0 GenericUnixRootPrompt#
I then checked /boot/grub/grub.conf
(since
something was being mounted as the root filesystem) and found that
the root partition wasn't /dev/hda2
but something like
/dev/VolGroup00/LogGroup00
. Using that I was able to check and
remount the fileystem as read/write. I was then able to add that to
/etc/fstab
, reboot the system and have it come up fine, thus
saving R from having to nuke-n-pave the system. How /etc/fstab
ended up without the root filesystem is something I don't know (but I
suspect it may have been trying to update that file when the power was
cut—hey, it's as good a theory as anything), but at least the system was
back up and running.
That just left the little problem of Asterisk continously dumping core in
an illegal instruction. A recompile of the program (since R and I thought
maybe the executable was corrupted) didn't solve the problem. A compile of
the lastest version didn't solve the problem, but we did notice that there
were a few modules for Asterisk installed that don't come with the default
install of Asterisk. And one of those modules had
pentium4-sse3
in the name.
I checked the box—it was a Pentium IV with SSE2, not a Pentium IV with SSE3.
That would definitely explain the crashing.
It seems that R hired someone to install a particular codec for Asterisk and they grabbed the wrong version (or rather, the version for the wrong processor) and the only reason Asterisk hadn't crashed was that it hadn't actually been loaded into Asterisk. Well, until the reboot that is. We removed that module and Asterisk started up fine.
And then it was time to turn to the problem that R had come to The Data Center to investigate …