Thursday, Debtember 04, 2003
troubling server crashes
One of the servers I'm monitoring (and it happens to be the most critical of servers, go figure) has crashed every day for the past week on a 24.5 hour schedule. This is not good, especially since the machine in question is not a Windows system, but a Linux system. The other admin and I (we're in a transition period as I take over) can't figure out what is causing the problem. The only major change this past week has been the installation of MySQL.
We're not sure what to make of the problem.
To that end, I installed Nagios, a framework of monitoring programs on another server to monitor the troublesome machine. It took a while to configure Nagios as the configuration file is complex, due to the separate definitions for hosts, services, contacts and groupings of hosts, services and contacts, but this complexity means you can fine tune the monitoring (and it's easy to add new hosts, services or contacts once the initial configuration is complete).
I will also be rebooting the server in a few hours in an attempt to see if it always crashes around 8:00 in the morning, or just after 24.5 hours since the last reboot; I doubt the crashes are due to the janitorial staff unplugging the computer to plug their vaccuum cleaner.
At least, I hope that's not the case.