Friday, April 14, 2000
“That which does not kill us, hurts like hell!”
Mark and I stopped off at Atlantic Internet after a dinner meeting to find one of the techs still there fiddling with his IOpener. We fiddled around with that, then I showed Mark one of the computers a customer I'm doing work for has.
The IOpener is small. The server I showed Mark was not. This is a large machine, dual Pentium III with one gig of RAM (a gigabyte!) and some 30 gigabytes of RAID-5 storage (small these days, I know).
I'm doing some work for this customer and I had noticed that the 30G of storage wasn't mounted on the server. So, as long as I was there, might as well mount the RAID array. Mark, having a RAID array at home, was on hand to help with the consulting.
megaraid: v107 (December 22, 1999) megaraid: found 0x101e:0x9010:idx 0:bus 0:slot 9:func 0 scsi0 : Found a MegaRAID controller at 0xd810, IRQ: 17 megaraid: [UF80:1.61] detected 1 logical drives scsi0 : AMI MegaRAID UF80 254 commands 16 targs 1 chans 8 luns scsi : 1 host. scsi0: scanning channel 1 for devices. scsi0: scanning virtual channel for logical drives. Vendor: MegaRAID Model: LD0 RAID5 35000R Rev: UF80 Type: Direct-Access ANSI SCSI revision: 02 Detected scsi disk sda at scsi0, channel 1, id 0, lun 0 SCSI device sda: hdwr sector= 512 bytes. Sectors= 71680000 [35000 MB] [35.0 GB] sda: sda1 sda2 sda3 <sda5 sda6 sda7> (scsi1) <ADAPTEC AIC-7890/1 ULTRA2 SCSI HOST ADAPTER> found at PCI 12/0 (scsi1) Wide Channel, SCSI ID=7, 32/255 SCBs (scsi1) Downloading sequencer code... 385 instructions downloaded scsi1 : Adaptec AHA274x/284x/294x (EISA/VLB/PCI-Fast SCSI) 5.1.23/3.2.4 <ADAPTEC AIC-7890/1 ULTRA2 SCSI HOST ADAPTER> scsi : 2 hosts.
From that, it looked like there were two disk controllers. The system was booting from SCSI, that much was apparent. What wasn't apparent was the location of the RAID system.
The BIOS POST also gave the impression of two controllers. We went into the RAID BIOS extention, initialized the RAID controller and drives and then rebooted the system.
Turns out that the megaraid and the Adaptec SCSI controller are one in the same and that the system itself (it runs Linux) was booting off the RAID controller!
It is through our mistakes that we learn.
And it is through grovelling that we retain our customers.
Fortunately, the customer didn't loose any important data (the customer wasn't using it fully at the time), nor did he mind that much (“Next time, please consult with me before you do any irrepairable configuration changes. Okay?”).
That, and I didn't like the way Linux was installed on the box to begin with.
While Mark and I were doing a fast recovery of a customer machine we received a call from John, the paper millionaire of a dotcom company and former member of a Grateful Dead cover band to say he couldn't get to his servers, located in the very same co-location facility we were currently at.
Mark goes over to John's machines. All servers are up, but he can't ping out. In fact, he can't get past the first hop. Mark then heads over to the core room, I remain in the co-location room, and we all get on a conference call.
Network seems okay—link light is on at both ends of the connection. No traffic. Jiggle the cord. Oh! A few packets. Then major lossage again. Repeat.
John is freaking out because he needs to be on a plane early and it's now 3:30 am or there abouts. He finally conferences in the main sysadmin for Atlantic Internet because Mark and I can't figure out what's going on.
Neither could the sysadmin. Everything seems okay. Only there's no traffic. John, panicing is yelling at Mark. Mark is yelling back at John not to panic. Meanwhile we can barely hear the sysadmin over the conference call. Pandemonium reigns.
I quickly grab the network analyzer they have (way too cool) an hook it to John's side of the connection. It lights up like a Christmas tree. Low utilization, high collisions and an even larger rate of errors. I then take the unit to the Atlantic Internet side. Nothing. Normal traffic from John's servers.
We then plug the network analyzer into the Cisco Catalyst 5000 which is serving as the main switch. Actually, it's more like three switched hubs than a real switch—there are 24 ports grouped into three sections. Each section is a hub, but switched between sections.
The network analyzer lights up like a Christmas tree.
The consensus seems to be that the Catalyst is hosed. It probably didn't survive a DoS attack a few days previously and was slowly going bad. So it was some quick work to rerun a few cables to nearby switches and remove the Catalyst from service.
Mark and I didn't leave the office until 5 am.