So the kernel is still running, not the userland stuff.
So I make plans on driving down to Ft. Lauderdale where the colocation facility I use is located. Now, last I heard, you are supposed to call down there to let them know you coming and since it's about midnight when tower (the machine in question) stopped responding, I can understand that. But Spring (who I have to pick up from work at 12:30 am) has a call phone; we can call as we drive down there.
Before I leave, Mark calls. He thinks it's a network problem at the facility. And he does have a point:
64 bytes from 220.127.116.11: icmp_seq=15 ttl=52 time=140.707 ms 64 bytes from 18.104.22.168: icmp_seq=16 ttl=52 time=155.107 ms 64 bytes from 22.214.171.124: icmp_seq=17 ttl=52 time=134.498 ms 64 bytes from 126.96.36.199: icmp_seq=18 ttl=52 time=141.993 ms 64 bytes from 188.8.131.52: icmp_seq=18 ttl=50 time=251.784 ms (DUP!) 64 bytes from 184.108.40.206: icmp_seq=18 ttl=48 time=362.700 ms (DUP!) 64 bytes from 220.127.116.11: icmp_seq=18 ttl=46 time=478.254 ms (DUP!) 64 bytes from 18.104.22.168: icmp_seq=18 ttl=44 time=584.002 ms (DUP!) 64 bytes from 22.214.171.124: icmp_seq=18 ttl=42 time=705.216 ms (DUP!) 64 bytes from 126.96.36.199: icmp_seq=18 ttl=40 time=816.821 ms (DUP!) 64 bytes from 188.8.131.52: icmp_seq=19 ttl=52 time=176.100 ms
And traceroutes also show some anomalies (note: Mark uses OpenBSD and its ping prints double ping packets. The one for Linux (which is what I use) doesn't. But traceroute under both show anomalies). So he does have a point. I call down there and with some minor hassle, get a trouble ticket submitted.
Now, here is where I digress a bit. One of my friends and clients colocates a server down there and since Mark and I do work for him (more in the past, still do stuff for him now) and since he doesn't use all his alloted bandwidth, he allowed us to place tower down there along with his. So it's really his account. Which is why I had a bit of a hassle getting a trouble ticket submitted (Server id? Server password? I have to submit a trouble ticket via the website? What?).
I go pick up Spring, and talk to Rob who works nights there about the problem and we both agreed it sounded more like a downed server than a network problem (although there are network problems). I'm not going through the problem I had last time and want to drive down there and reboot it myself. Spring didn't have a problem with that, although she didn't have her cell phone with her so we couldn't call ahead.
Oh well. We'll deal with getting in when we get there.
Half an hour later, I'm buzzing and knocking on the door. They're asking me questions from inside (which I can barely hear) and I'm shouting answers back to them (which they can barely hear). They finally open the door and let me in.
“You're supposed to fill out an On-Site Request Form on our webpage,” said the technician. “Then it'll be approved within two hours.”
“I wasn't aware of that,” I said. What I was thinking was, Two XXXXXXX hours? My server is down and I have to wait two XXXXXXXXXXXXXXXXXXXXXX hours to be approved? What's up with that?
“But you're supposed to fill out an On-Site Request Form and be approved,” said the technician. More bantering between the two of us. “Okay, let me get you a form to fill out.” He goes off. Spring and I sit down and wait.
Where Spring works, colocation customers have 24 hour access to their boxes and they don't have to fill out “On-Site Request Forms.” When I worked for the ISP we had colocation customers with 24 hour access and no “On-Site Request Forms” to fill out.
I can see perhaps calling ahead to inform them you are on the way. I cannot see filling out a form and waiting two hours for approval, especially since we live over half an hour away (by car), even more so because the server is down and a two hour (minimum) outtage is bad.
The technician wanders back in, hands me the forms, and disappears again.
I fill out the forms. I'm not requesting adding or removing equipment (which can only be done between regular business hours by the way)—just checking out a server.
Then we wait.
Hold on a sec—no. Wait.
Finally I am allowed back into the server room. Spring stays behind in the lobby, not wanting to further complicate things. The technician leads me through the maze of racks to my server and hooks up a monitor and keyboard (which are on a crash cart) to it. I then get to work.
Yup. Run-a-way process sucking up all memory and not even a Ctrl-Alt-Del will bring this Linux system down (I'm able to check memory usage with Shift-ScrollLock and processes with Ctrl-ScrollLock). Power cycle. I thought I could skip the rather expensive fsck of the 17G harddrive by booting into single user mode, but alas, I was wrong (I had wanted to remove any possibility of the offending program from running). I debate about waiting around for it to finish (about half an hour) but decide against that. I manage to reboot it, this time normally and leave.
Now, what I should have done is told one of the technicians there to let the system run because it takes about 30-40 minutes for it to check the disk. But there were no technicians around and I had my fill of the place for the night.
Spring and I leave. On the way home we stop and get a bite to eat. It's almost 4:00 am we're back home and tower still isn't back up. What the …
I check the trouble ticket—as per their policy, the machine wasn't responding to any of their monitoring software so of course they rebooted it. Aarhglghlghahhhhhhhhhhhalg! The time of the last comment was about 3:40 am so tower should be nearly finished rebooting. I starting pinging and as soon as I get a response I start logging in and removing the ofending program and check the system out.
Now, I had rebooted the machine twice. They ended up rebooting it four times.
My guess is that they're used to customers who don't manage their own servers and leave that stuff up to them. Not bad in and of itself but it certainly isn't what I expect and having to deal with their rules is a bit grating, but I really can't beat the price right now.