Saturday, March 18, 2006
A comedic series of setbacks
[The events described herein actually happened yesterday, but technically spilled over into today, so there you go. —Ed]
Well, that was certainly pleasant.
What was supposed to be a simple upgrade dominoed into a full scale fiasco.
But first, a bit of setup.
At The Company, we have seven name servers. Two are used by all the computers here to resolve DNS queries only; no configuration changes are required on these two machines and therefore are not part of this story.
Four of the name servers are authoritative name servers for our domains. These machines will only respond to queries on the domains we host; all other queries (like recursive DNS queries) are ignored.
To make things easier on us, the remaining name server actually hosts all the zone files and pushes them out to the four authoritative name servers (in effect, the four authoritative name servers are slaves of this one server, but the outside world will never see this server). Therefore, we can make changes to the zone files on one server and have the changes automatically pushed out.
That still leaves the problem of new zones being added (which is a sore point with me with reguard to bind). Any new zone that's added, the configuration of not only the one master server needs changing, but the configuration file of the four authoritative (aka “slave” servers) also require changing. While I have a script that will generate all five configuration files, we still need to copy four of the configuration files to each of the servers.
So I wanted to automate the copying of the new configuration files and
restarting the name servers on all the authoritative name servers when the
configuration files are created. And to do that, I needed to set up a trust
mechanism so that the server that has all the zones can copy a configuration
file and restart the nameservers without intervention. Easy enough to do
with ssh
.
But there were some … interoperability issues between the
various machines with respect to their various instances of ssh
(bascially, scp
(secure copy) didn't work due to protocol
differences). Easily solved by installing the latest version of OpenSSH on each machine.
Well, actually, installing the latest version of zlib, then OpenSSL and then OpenSSH.
This master server, the one with all the zones, didn't need this upgrade, so I didn't bother with that. The four authoritative name servers, however, needed the upgrades. Now, I should mention at this point that the four authoritative name servers are all Cobalt RaQs—sure they're pretty old, but at 1U high and a low power consumption, they're fine for doing the dedicated task of resolving DNS queries.
The upgrade went smoothly on three of the machines—pretty much:
# cd zlib-1.2.3 # ./configure # make # make install # cd ../openssl-0.9.8a # ./configure # make # make install # cd ../openssh-4.3p1 # ./configure # make # make install # /etc/rc.d/init.d/sshd stop # /etc/rc.d/init.d/sshd start #
On the fourth machine (which happened to be the primary of our
authoritative name servers) the make install
of OpenSSH failed.
Of which I didn't notice.
Oops.
The result of which was a borked program that refused to run, and no backup of the working version.
Oops.
Somehow, I ended up being logged out of the machine. And without a
working sshd
there was no way I could log back into the
machine.
Well, not easily.
You see, the Cobalt RaQs don't have video or keyboard ports. They're designed as servers—they don't really need such devices. They do, however, have a serial port you can log in through.
So I hook up a serial cable from a nearby server to the RaQ in question and that's when I got hit with Murphy's Law yet again—the serial login was disabled.
Hrm.
Okay. Take the machine out of the rack, take the drive out of the machine, hook it up to my workstation, change it so one can log in through the serial port, put the drive back in, power up the machine and log in through the serial port.
So I start to take the machine out of the rack when I get hit with Murphy's Law for a third time—one of the screws is stripped, so therefore I can't get it out of the rack.
Okay, now what?
I know that Linux (which is what runs on this Cobalt RaQ) can support a serial console. Maybe I can boot into single user mode and go from there.
Nice idea, but apparently the Linux kernel for these boxes don't support
the serial console (as incredible as that may seem). Yes, I can see the
shell prompt in single user mode, but everything I type just goes into the
bit bucket (and this I try several times, with different arguments to the
Linux kernel to try to get it to use a serial console). And each time I do
this, I end up having to shut the machine down, which leaves the file system
in an inconsistant state, requiring the use of fsck
to fix.
Okay, so I really need to get the machine out of the rack. But how to do that? I'm looking at the situation when I get an idea: I'll attack the problem from a different angle. Literally! The Cobalt RaQ has two “wings” (one on either side) which are attached by screws, and it's these “wings” which are then screwed into the rack. I can get access to the screws holding the wing in place. So, I effectively remove the wing from the RaQ and it slides right out.
Then it's to my workstation. Open up the Cobalt RaQ, remove the drive,
attach said drive to the external USB drive case, turn it on, run fsck
on the
drive, edit the configuration to allow logins from the serial port, umount
the drive, power it down, remove it from the USB drive, put it back into the RaQ, power it on
and—
—have it fail to boot.
You see, when I “fixed” the drive using fsck
on my
workstation, it marked the drive as being a newer version of the filesystem.
Which the fsck
on the Cobalt RaQ doesn't support (as part of
the boot up sequence, it automatically checks the drives using
fsck
).
Murphy strikes again.
Okay, attach the drive to my workstation, copy over a newer version of
fsci
, move the drive back to the RaQ and power up—
—only to have it fail yet again. Apparently, the old version of
fsck
used options that the new version of fsck
doesn't like. So back to the workstation, modify the startup scripts to
remove the options fsck
is bitching about, try again, move the
drive back to the workstation because I apparently edited the wrong startup
scripts and try again only to find out I mucked it up again, so back to the
workstation …
It was about half an hour of moving the drive back and forth before I got the RaQ to finally finished booting and to the point where I can log in sucessfully through the serial port.
Start the OpenSSH install from scratch.
# tar xzvf ../archive/openssh-4.3p1.tar.gz # cd openssh-4.3p1 # ./configure
Only now configure
failed!
Huh?
I check, and gcc
failed with an internal error!
I try a quick C program and yup, gcc
is totally borked
now.
At this point, the only thing left is to reinstall the operating system.
Now, the installation procedure for the Cobalt RaQ 3 and 4s requires another PC, which you boot using a special CD. The PC in question must have a single network port and one (1) CD drive. Anything else will confuse the installation CD. Once this CD is booted, you then force the Cobalt RaQ to do a netboot. Ths PC will see the netboot request, and feed it an installtion program which will install Linux on the Cobalt RaQ.
Because of the requirements of the installation CD, I have to use P's computer as it fits the requirements of the installation CD. But P's computer is on its last legs, sounding much like a dying diesel engine. But it's not dead yet.
Only it hung during the installation, having difficulty reading the CD.
I try it again. Same thing.
I turn off P's computer for half an hour. You know, let it cool down. Try it again. Same thing.
Smirk then suggested another computer in the office.
Same thing—it hangs.
It was then I remembered something from my past experiences with installing Cobalt RaQs: don't use the Cobalt RaQ 3 installation disks! They don't work. Use the Cobalt RaQ 4 installation disks instead (even if you are installing on a Cobalt RaQ 3, which I was).
That worked.
So now I had a fresh install of the operating system.
But no ssh
.
Now, how to get files to the box … okay, the Cobalt RaQ has
ftp
. Okay, compile an FTP server on my workstation. Then use
ftp
to transfer zlib, openssl, openssh and bind to the Cobalt
RaQ. Spend the next couple of hours compiling.
Finally had it back up and running and could finish the job I started some eight hours previously.
Blarg.
Sucess
Yes!
Every few months for the past year or two, I've been trying to make cornbread.
My Great Aunt (my Mom's Aunt) Freddie (yup, that is her real name) would made this wonderful cornbread and the thing was—I knew how she made it too. Using two packages of Martha White's Cotton Pickin' Cornbread™ Mix (just add water) baked in a hot skillet in the oven (you pre-heat the skillet).
Only that has never worked for me. Either they came out too flat (used too big of a skillet) or stuck like crazy to the skillet and the results where okay at best.
Tonight, I went ahead and made cornbread from scratch (flour, corn meal, baking power, baking soda, salt, egg, butter milk, shortening) and made sure the skillet (the 8″ one) was well greased.
It popped right out of the skillet and was the right thickness.
Woot!
And boy, did it go fast!