Monday, February 07, 2005
Kludge works
The server that this site was being hosted on. The one that went down last month? Well, last week it was retrieved and left behind my cubicle with that neat Zen-like emptiness to it. I trotted it down to the data center at The Company, powered it up, and set it to constantly read from the harddrives as I suspected those were the failure point.
Today I checked in on the server and found spewed all over the console window:
eth0: command 0x5800 did not complete! Status=0xffff eth0: Resetting the Tx ring pointer. NETDEV WATCHDOG: eth0: transmit timed out eth0: transmit timed out, tx_status ff status ffff. diagnostics: net ffff media ffff dma ffffffff. eth0: Transmitter encountered 16 collisions -- network cable problem? eth0: Interrupt posted but not delivered -- IRQ blocked by another device? Flags; bus-master 1, dirty 47985(1) current 48001(1) Transmit list ffffffff vs. f7cea240. eth0: command 0x3002 did not complete! Status=0xffff 0: @f7cea200 length 8000002a status 8000002a 1: @f7cea240 length 8000002a status 0000002a 2: @f7cea280 length 8000002a status 0000002a 3: @f7cea2c0 length 8000002a status 0000002a 4: @f7cea300 length 8000002a status 0000002a 5: @f7cea340 length 8000002a status 0000002a 6: @f7cea380 length 8000002a status 0000002a 7: @f7cea3c0 length 8000002a status 0000002a 8: @f7cea400 length 8000002a status 0000002a 9: @f7cea440 length 8000002a status 0000002a 10: @f7cea480 length 8000002a status 0000002a 11: @f7cea4c0 length 8000002a status 0000002a 12: @f7cea500 length 8000002a status 0000002a 13: @f7cea540 length 8000002a status 0000002a 14: @f7cea580 length 8000002a status 0000002a 15: @f7cea5c0 length 8000002a status 8000002a wait_on_irq, CPU 0: irq: 0 [ 0 0 ] bh: 1 [ 0 1 ] Stack dumps: CPU 1:00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 Call Trace: CPU 0:c1f31f2c 00000000 00000000 ffffffff 00000000 c0109a5d c029d094 00000000 f6a2c000 c0108d22 00000000 c02c5ce4 c02001a4 ecf7a000 c1f31f98 c0115621 ecf7a000 f6a2c368 c02c5ce4 c1f31f8c c1f30664 c1f30000 c011c5cf f6a2c000 Call Trace: [<C0109A5D>] [<C0108D22>] [<C02001A4>] [<C0115621>] [<C011C5CF>] [<C0125234>] [<C0125070>] [<C0105000>] [<C010588B>] [<C0125070>]
eth0
, by the way, is the network interface.
The harddrives were still running the test I left them to, and no errors whatsoever. Dan, the network engineer, did admit to borrowing the network cable that was plugged into that server earlier this morning, and that was enough to trigger the slew of errors I saw (and I did thank him for doing that—otherwise I don't think I would have ever found what the problem might have been).
Well (and it's here I wished I remembered to take pictures).
The rack mounted case the system is in is not very tall, and certainly not tall enough to house a PCI card. So in one of the PCI slots was special card with PCI slots on it, such that you could now mount cards parallel to the motherboard (instead of perpendicular). Only you couldn't use the existing mounting bracket on the PCI card in question. So the mounting bracket on the network card had been removed, and only held in place by the friction between the card and the slot itself.
Not a stable connection.
Problems with this server only really started when some equipment was removed from the cabinet it was in, and I'm guessing the network cable was bumped just enough to make life interesting.
On the plus side, that means the server doesn't need to be replaced or even rebuilt (thankfully!).