So the kernel is still running, not the userland stuff.
So I make plans on driving down to Ft. Lauderdale where the colocation facility I use is located. Now, last I heard, you are supposed to call down there to let them know you coming and since it's about midnight when tower (the machine in question) stopped responding, I can understand that. But Spring (who I have to pick up from work at 12:30 am) has a call phone; we can call as we drive down there.
Before I leave, Mark calls. He thinks it's a network problem at the facility. And he does have a point:
64 bytes from 126.96.36.199: icmp_seq=15 ttl=52 time=140.707 ms 64 bytes from 188.8.131.52: icmp_seq=16 ttl=52 time=155.107 ms 64 bytes from 184.108.40.206: icmp_seq=17 ttl=52 time=134.498 ms 64 bytes from 220.127.116.11: icmp_seq=18 ttl=52 time=141.993 ms 64 bytes from 18.104.22.168: icmp_seq=18 ttl=50 time=251.784 ms (DUP!) 64 bytes from 22.214.171.124: icmp_seq=18 ttl=48 time=362.700 ms (DUP!) 64 bytes from 126.96.36.199: icmp_seq=18 ttl=46 time=478.254 ms (DUP!) 64 bytes from 188.8.131.52: icmp_seq=18 ttl=44 time=584.002 ms (DUP!) 64 bytes from 184.108.40.206: icmp_seq=18 ttl=42 time=705.216 ms (DUP!) 64 bytes from 220.127.116.11: icmp_seq=18 ttl=40 time=816.821 ms (DUP!) 64 bytes from 18.104.22.168: icmp_seq=19 ttl=52 time=176.100 ms
And traceroutes also show some anomalies (note: Mark uses OpenBSD and its ping prints double ping packets. The one for Linux (which is what I use) doesn't. But traceroute under both show anomalies). So he does have a point. I call down there and with some minor hassle, get a trouble ticket submitted.
Now, here is where I digress a bit. One of my friends and clients colocates a server down there and since Mark and I do work for him (more in the past, still do stuff for him now) and since he doesn't use all his alloted bandwidth, he allowed us to place tower down there along with his. So it's really his account. Which is why I had a bit of a hassle getting a trouble ticket submitted (Server id? Server password? I have to submit a trouble ticket via the website? What?).
I go pick up Spring, and talk to Rob who works nights there about the problem and we both agreed it sounded more like a downed server than a network problem (although there are network problems). I'm not going through the problem I had last time and want to drive down there and reboot it myself. Spring didn't have a problem with that, although she didn't have her cell phone with her so we couldn't call ahead.
Oh well. We'll deal with getting in when we get there.
Half an hour later, I'm buzzing and knocking on the door. They're asking me questions from inside (which I can barely hear) and I'm shouting answers back to them (which they can barely hear). They finally open the door and let me in.
“You're supposed to fill out an On-Site Request Form on our webpage,” said the technician. “Then it'll be approved within two hours.”
“I wasn't aware of that,” I said. What I was thinking was, Two XXXXXXX hours? My server is down and I have to wait two XXXXXXXXXXXXXXXXXXXXXX hours to be approved? What's up with that?
“But you're supposed to fill out an On-Site Request Form and be approved,” said the technician. More bantering between the two of us. “Okay, let me get you a form to fill out.” He goes off. Spring and I sit down and wait.
Where Spring works, colocation customers have 24 hour access to their boxes and they don't have to fill out “On-Site Request Forms.” When I worked for the ISP we had colocation customers with 24 hour access and no “On-Site Request Forms” to fill out.
I can see perhaps calling ahead to inform them you are on the way. I cannot see filling out a form and waiting two hours for approval, especially since we live over half an hour away (by car), even more so because the server is down and a two hour (minimum) outtage is bad.
The technician wanders back in, hands me the forms, and disappears again.
I fill out the forms. I'm not requesting adding or removing equipment (which can only be done between regular business hours by the way)—just checking out a server.
Then we wait.
Hold on a sec—no. Wait.
Finally I am allowed back into the server room. Spring stays behind in the lobby, not wanting to further complicate things. The technician leads me through the maze of racks to my server and hooks up a monitor and keyboard (which are on a crash cart) to it. I then get to work.
Yup. Run-a-way process sucking up all memory and not even a Ctrl-Alt-Del will bring this Linux system down (I'm able to check memory usage with Shift-ScrollLock and processes with Ctrl-ScrollLock). Power cycle. I thought I could skip the rather expensive fsck of the 17G harddrive by booting into single user mode, but alas, I was wrong (I had wanted to remove any possibility of the offending program from running). I debate about waiting around for it to finish (about half an hour) but decide against that. I manage to reboot it, this time normally and leave.
Now, what I should have done is told one of the technicians there to let the system run because it takes about 30-40 minutes for it to check the disk. But there were no technicians around and I had my fill of the place for the night.
Spring and I leave. On the way home we stop and get a bite to eat. It's almost 4:00 am we're back home and tower still isn't back up. What the …
I check the trouble ticket—as per their policy, the machine wasn't responding to any of their monitoring software so of course they rebooted it. Aarhglghlghahhhhhhhhhhhalg! The time of the last comment was about 3:40 am so tower should be nearly finished rebooting. I starting pinging and as soon as I get a response I start logging in and removing the ofending program and check the system out.
Now, I had rebooted the machine twice. They ended up rebooting it four times.
My guess is that they're used to customers who don't manage their own servers and leave that stuff up to them. Not bad in and of itself but it certainly isn't what I expect and having to deal with their rules is a bit grating, but I really can't beat the price right now.
what was the ofending program?
Glad you asked (whince).
It turned out to be mod_blog, the program that runs this very site.
A friend of mine (who for now wishes to remain anonymous) is interested in blogging and I said I would set him up with my system. So I create a site for him, copy over my existing template, modify the configuration for him, etc., etc. I then send in (via email) the first post to see if things work.
Well, that's where things didn't work.
The post itself was accepted and stored correctly.
Small digression: The Boston Diaries is primarily dynamic. You type in something like http://boston.conman.org/2002/3/4 and that page is generated on the fly. I change the template, the effect takes place immediately. However, the main page, the one you get by going to http://boston.conman.org/ is not dyamic—it's actually a static page recreated whenever a new entry is posted. It doesn't have to be, but I figure that since this page is probably going to be loaded most often I might as well cache a static copy to keep the system load down.
So, part of the process of accepting and storing an entry is the generation of the main page. Normally, it works fine. But not in this case.
Another small digression: the configuration file for mod_blog needs the starting date of the blog. There are cases where I need this information and instead of wasting a lot of time going backwards from now finding the first entry, again, it's cached information.
I had thought that I may have made a mistake in the starting date. No, I got the starting date correct. What I didn't get correct was handling the situation when a blog is actually started.
When writing the software, I had already been keeping entries. In fact, I think I had a month or so worth of entries when I started the code two years ago. And I was so focused on getting the URL processing correct, that I neglected to test some border cases. And I never got around to testing those cases since they didn't affect me.
Problems I found:
- Not handling the case when there are no entries.
- Not handling the case when there are fewer than X days worth of entries (where X is the number of days to display on the main page).
- Not handling the case when there are fewer than 15 entries (note—not the same thing as having 15 days worth of entries, and this is for the RSS file).
- And one or two cases of not checking to see if you've past the first entry or most current entry.
Cases I should have handled (and tested for!) but neglected.
I do need to really go through the code and clean it up.
The problem I had was that I'm a bit too close to the code. I'm working towards a specific goal (new method of document storage retrieval and reference) and as such, the software is experimental and the problems I'm focusing on meant I missed some reliability details elsewhere, since hey, it works in my case.
And while some people have probably grabbed the software I doubt many, if any, are actually using the software since I'm not getting any feedback on the code itself. Okay, you do have to hunt around to find the link to the source code but it has been downloaded. And I'm sure it being written in C makes it all that much more popular.
But little did I expect software I wrote to crash a Unix server. The last time I saw userland software (an application) crash a Unix server was … oh … eight years ago I think.
Grey Matter, which seems to be one of the more popular blogging software packages out there. I downloaded it, partly out of curiosity and partly for a project I'm working on (and yes, I'm scoping out the competition).
Now, I can see why Grey Matter is popular: it installs very easily (I had it running in a few minutes), is template driven (so the output can look exactly like you want it to) and … it's in Perl.
Which means—if the web server allows CGI, you can run Grey Matter. There is no compiling. Just stick it in, make sure the location of Perl in the scripts is correct, and go.
I think that reason alone, is why Perl is pretty much used everywhere on the World Wide Web.
Now, my software that runs this site is in C. One, I can't stand Perl. Never had. And thankfully, I've never had to maintain Perl code either. Two, tower (the server that runs this site) is a 486. A 33MHz 486. By today's standards the machine shouldn't be running, much less running a website. You can't even give 486 based machines away, which is sad, since they work. This site is proof. But anyway, this machine is slow, and running a blog written in Perl would be torture indeed.
Like Grey Matter. I'm doing my testing of it on a 120MHz machine and Grey Matter is SSS-LLL-OOO-WWW. Not quite painfully slow (painfully slow would be running it on tower) but too slow to be used by more than a few people on my local box here.
office and I notice that pineal, the SGI box I use, has crashed. Hard. The kernel crashed.
For a Unix system, that's bad.
Not knowing what happened, I bring the box back up and go about my business.
A few days later, it's crashed again.
This time though, I have a slight clue as to what might be going on. I had noticed one of the users of the box doing some odd things. Normally, I wouldn't care (seeing how this user was odd to begin with) but I couldn't help thinking that the odd thing this user was doing might have caused the machine to crash.
I, suspecting what happened, bring the box back up and go about my business.
A few days later, it's crashed again.
I bring it back up, get the odd user in question, and asked him to do what he was doing just prior to the crash, exactly.
I watch as the machine crashes.
Now, I still don't know why it crashed, but at least I knew what caused it to crash. It seems that the user in question logged in and ran a program called screen. screen is a program that allows you to have multiple command lines via a single login session (like a single terminal) and it would keep the session alive if you disconnected (heck, I used that program myself for those two reasons). Then, he would log into IRC and have his IRC client log a channel to a file. He would then disconnect, leaving the IRC client running (because of screen) and logging a channel to a file. Doing both of those things would cause the system to crash.
Odd. Then again, he was an odd user.
So I basically banned the use of IRC on my box. He was the only user who used IRC and he had access to other systems with it, so it wasn't that big of a loss.
But it's odd the programs that can crash a Unix server.
bad enough I just found I lost my checkbook.
Interesting how I found something is lost. Funny how the English language works.
So now it's off to scour the Facility in the Middle of Nowhere to see if I can find a checkbook that I just found was lost.
My head hurts.
checkbook. Earlier I found I lost it. Now I found what I found I lost.
And my head doesn't hurt as much.