The Boston Diaries

Monday, February 07, 2005

Kludge works

The server that this site was being hosted on. The one that went down last month? Well, last week it was retrieved and left behind my cubicle with that neat Zen-like emptiness to it. I trotted it down to the data center at The Company, powered it up, and set it to constantly read from the harddrives as I suspected those were the failure point.

Today I checked in on the server and found spewed all over the console window:

eth0: command 0x5800 did not complete! Status=0xffff
eth0: Resetting the Tx ring pointer.
NETDEV WATCHDOG: eth0: transmit timed out
eth0: transmit timed out, tx_status ff status ffff.
  diagnostics: net ffff media ffff dma ffffffff.
eth0: Transmitter encountered 16 collisions -- network cable problem?
eth0: Interrupt posted but not delivered -- IRQ blocked by another device?
  Flags; bus-master 1, dirty 47985(1) current 48001(1)
  Transmit list ffffffff vs. f7cea240.
eth0: command 0x3002 did not complete! Status=0xffff
  0: @f7cea200  length 8000002a status 8000002a
  1: @f7cea240  length 8000002a status 0000002a
  2: @f7cea280  length 8000002a status 0000002a
  3: @f7cea2c0  length 8000002a status 0000002a
  4: @f7cea300  length 8000002a status 0000002a
  5: @f7cea340  length 8000002a status 0000002a
  6: @f7cea380  length 8000002a status 0000002a
  7: @f7cea3c0  length 8000002a status 0000002a
  8: @f7cea400  length 8000002a status 0000002a
  9: @f7cea440  length 8000002a status 0000002a
  10: @f7cea480  length 8000002a status 0000002a
  11: @f7cea4c0  length 8000002a status 0000002a
  12: @f7cea500  length 8000002a status 0000002a
  13: @f7cea540  length 8000002a status 0000002a
  14: @f7cea580  length 8000002a status 0000002a
  15: @f7cea5c0  length 8000002a status 8000002a

wait_on_irq, CPU 0:
irq:  0 [ 0 0 ]
bh:   1 [ 0 1 ]
Stack dumps:
CPU 1:00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 
       00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 
       00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 
Call Trace:   

CPU 0:c1f31f2c 00000000 00000000 ffffffff 00000000 c0109a5d c029d094 00000000 
       f6a2c000 c0108d22 00000000 c02c5ce4 c02001a4 ecf7a000 c1f31f98 c0115621 
       ecf7a000 f6a2c368 c02c5ce4 c1f31f8c c1f30664 c1f30000 c011c5cf f6a2c000 
Call Trace:    [<C0109A5D>] [<C0108D22>] [<C02001A4>] [<C0115621>] [<C011C5CF>]
  [<C0125234>] [<C0125070>] [<C0105000>] [<C010588B>] [<C0125070>]

eth0, by the way, is the network interface.

The harddrives were still running the test I left them to, and no errors whatsoever. Dan, the network engineer, did admit to borrowing the network cable that was plugged into that server earlier this morning, and that was enough to trigger the slew of errors I saw (and I did thank him for doing that—otherwise I don't think I would have ever found what the problem might have been).

Well (and it's here I wished I remembered to take pictures).

The rack mounted case the system is in is not very tall, and certainly not tall enough to house a PCI card. So in one of the PCI slots was special card with PCI slots on it, such that you could now mount cards parallel to the motherboard (instead of perpendicular). Only you couldn't use the existing mounting bracket on the PCI card in question. So the mounting bracket on the network card had been removed, and only held in place by the friction between the card and the slot itself.

Not a stable connection.

Problems with this server only really started when some equipment was removed from the cabinet it was in, and I'm guessing the network cable was bumped just enough to make life interesting.

On the plus side, that means the server doesn't need to be replaced or even rebuilt (thankfully!).

Monday, February 07, 2005

Kludge works

Obligatory Picture

Obligatory Contact Info

Obligatory Feeds

Obligatory Links

Obligatory Miscellaneous

Obligatory AI Disclaimer