Wednesday, March 20, 2002
Well, that was pleasant
My colocation half drops off the face of the earth. I can still
ping
it; I
can traceroute
to it, but any higher form of connection (TCP connections for example)
just hang.
So the kernel is still running, not the userland stuff.
So I make plans on driving down to Ft. Lauderdale where the colocation
facility I use is located. Now, last I heard, you are supposed to call down
there to let them know you coming and since it's about midnight when
tower
(the machine in question) stopped responding, I can
understand that. But Spring (who I have to pick up from work
at 12:30 am) has a call phone; we can call as we drive down there.
Before I leave, Mark calls. He thinks it's a network problem at the facility. And he does have a point:
64 bytes from 66.33.1.143: icmp_seq=15 ttl=52 time=140.707 ms 64 bytes from 66.33.1.143: icmp_seq=16 ttl=52 time=155.107 ms 64 bytes from 66.33.1.143: icmp_seq=17 ttl=52 time=134.498 ms 64 bytes from 66.33.1.143: icmp_seq=18 ttl=52 time=141.993 ms 64 bytes from 66.33.1.143: icmp_seq=18 ttl=50 time=251.784 ms (DUP!) 64 bytes from 66.33.1.143: icmp_seq=18 ttl=48 time=362.700 ms (DUP!) 64 bytes from 66.33.1.143: icmp_seq=18 ttl=46 time=478.254 ms (DUP!) 64 bytes from 66.33.1.143: icmp_seq=18 ttl=44 time=584.002 ms (DUP!) 64 bytes from 66.33.1.143: icmp_seq=18 ttl=42 time=705.216 ms (DUP!) 64 bytes from 66.33.1.143: icmp_seq=18 ttl=40 time=816.821 ms (DUP!) 64 bytes from 66.33.1.143: icmp_seq=19 ttl=52 time=176.100 ms
And traceroute
s also show some anomalies (note: Mark uses OpenBSD and its
ping
prints double ping packets. The one for Linux (which is what
I use) doesn't. But traceroute
under both show anomalies). So he
does have a point. I call down there and with some minor hassle, get a
trouble ticket submitted.
Now, here is where I digress a bit. One of my friends and clients colocates
a server down there and
since Mark and I do work for him (more in the past, still do stuff for him
now) and since he doesn't use all his alloted bandwidth, he allowed us to
place tower
down there along with his. So it's really his
account. Which is why I had a bit of a hassle getting a trouble ticket
submitted (Server id? Server password? I have to submit a trouble
ticket via the website? What?).
I go pick up Spring, and talk to Rob who works nights there about the problem and we both agreed it sounded more like a downed server than a network problem (although there are network problems). I'm not going through the problem I had last time and want to drive down there and reboot it myself. Spring didn't have a problem with that, although she didn't have her cell phone with her so we couldn't call ahead.
Oh well. We'll deal with getting in when we get there.
Half an hour later, I'm buzzing and knocking on the door. They're asking me questions from inside (which I can barely hear) and I'm shouting answers back to them (which they can barely hear). They finally open the door and let me in.
“You're supposed to fill out an On-Site Request Form on our webpage,” said the technician. “Then it'll be approved within two hours.”
“I wasn't aware of that,” I said. What I was thinking was, Two XXXXXXX hours? My server is down and I have to wait two XXXXXXXXXXXXXXXXXXXXXX hours to be approved? What's up with that?
“But you're supposed to fill out an On-Site Request Form and be approved,” said the technician. More bantering between the two of us. “Okay, let me get you a form to fill out.” He goes off. Spring and I sit down and wait.
Where Spring works, colocation customers have 24 hour access to their boxes and they don't have to fill out “On-Site Request Forms.” When I worked for the ISP we had colocation customers with 24 hour access and no “On-Site Request Forms” to fill out.
I can see perhaps calling ahead to inform them you are on the way. I cannot see filling out a form and waiting two hours for approval, especially since we live over half an hour away (by car), even more so because the server is down and a two hour (minimum) outtage is bad.
The technician wanders back in, hands me the forms, and disappears again.
I fill out the forms. I'm not requesting adding or removing equipment (which can only be done between regular business hours by the way)—just checking out a server.
Then we wait.
Wait.
Wait.
Hold on a sec—no. Wait.
Finally I am allowed back into the server room. Spring stays behind in the lobby, not wanting to further complicate things. The technician leads me through the maze of racks to my server and hooks up a monitor and keyboard (which are on a crash cart) to it. I then get to work.
Yup. Run-a-way process sucking up all memory and not even a
Ctrl-Alt-Del
will bring this Linux system down (I'm able to check
memory usage with Shift-ScrollLock
and processes with
Ctrl-ScrollLock
). Power cycle. I thought I could skip the rather
expensive fsck
of the 17G harddrive by booting into single user
mode, but alas, I was wrong (I had wanted to remove any possibility of the
offending program from running). I
debate about waiting around for it to finish (about half an hour) but decide
against that. I manage to reboot it, this time normally and leave.
Now, what I should have done is told one of the technicians there to let the system run because it takes about 30-40 minutes for it to check the disk. But there were no technicians around and I had my fill of the place for the night.
Spring and I leave. On the way home we stop and get a bite to eat. It's
almost 4:00 am we're back home and tower
still isn't back
up. What the …
I check the trouble ticket—as per their policy, the machine wasn't
responding to any of their monitoring software so of course they rebooted
it. Aarhglghlghahhhhhhhhhhhalg! The time of the last comment was
about 3:40 am so tower
should be nearly finished rebooting. I
starting ping
ing and as soon as I get a response I start logging in
and removing the ofending program and check the system out.
Now, I had rebooted the machine twice. They ended up rebooting it four times.
Sigh.
My guess is that they're used to customers who don't manage their own servers and leave that stuff up to them. Not bad in and of itself but it certainly isn't what I expect and having to deal with their rules is a bit grating, but I really can't beat the price right now.