Friday, October 16, 2009
”Red Alert!” “Where? I don't see any lerts around here!”
Yesterday felt like Friday, mainly because The Weekly Meeting™ was canceled at the last minute, thus I didn't have to leave The Home Office.
But it was still a weekday today when I got a call rather early from Smirk. “Sean!” he yelled through the phone into my ear, “not only do we have a customer down, but we have network traces that show it's our equipment that's down!” Behind him I could hear the Klaxons™ in the background.
I mumbled something about calling him back, wandered over to The Home
Office, and started a Level Five Diagnostic Program poking into
the routers.
Oct 16 11:06:59.490 EDT: %RSP-3-ERROR: MD error 0080000080000000 -Traceback= 60385460 60385B24 60385C48 60386614 603513F4 Oct 16 11:06:59.494 EDT: %RSP-3-ERROR: SRAM parity error (bytes 0:7) 80 -Traceback= 60385538 60385B24 60385C48 60386614 603513F4 Oct 16 11:06:59.494 EDT: %VIP2 R5K-3-MSG: slot1 VIP-3-SVIP_CYBUSERROR_INTERRUPT: A Cybus Error occured. Oct 16 11:06:59.498 EDT: %VIP2 R5K-1-MSG: slot1 CYASIC Error Interrupt register 0xC Oct 16 11:06:59.502 EDT: %VIP2 R5K-1-MSG: slot1 Parity Error internal to CYA Oct 16 11:06:59.506 EDT: %VIP2 R5K-1-MSG: slot1 Parity Error in data from CyBus Oct 16 11:06:59.514 EDT: %VIP2 R5K-1-MSG: slot1 CYASIC Other Interrupt register 0x100 Oct 16 11:06:59.518 EDT: %VIP2 R5K-1-MSG: slot1 QE HIGH Priority Interrupt Oct 16 11:06:59.522 EDT: %VIP2 R5K-1-MSG: slot1 QE RX HIGH Priority Interrupt Oct 16 11:06:59.526 EDT: %VIP2 R5K-1-MSG: slot1 CYBUS Error Cmd/Addr 0x8001A00 Oct 16 11:06:59.530 EDT: %VIP2 R5K-1-MSG: slot1 MPUIntfc/PacketBus Error register 0x0 Oct 16 11:06:59.534 EDT: %VIP2 R5K-3-MSG: slot1 VIP-3-SVIP_PMAERROR_INTERRUPT: A PMA Error occured. Oct 16 11:06:59.538 EDT: %VIP2 R5K-1-MSG: slot1 PA Bay 0 Upstream PCI-PCI Bridge, Handle=0 Oct 16 11:06:59.542 EDT: %VIP2 R5K-1-MSG: slot1 DEC21050 bridge chip, config=0x0 Oct 16 11:06:59.546 EDT: %VIP2 R5K-1-MSG: slot1 (0x00):dev, vendor id = 0x00011011 Oct 16 11:06:59.550 EDT: %VIP2 R5K-1-MSG: slot1 (0x04):status, command = 0x02800147 Oct 16 11:06:59.554 EDT: %VIP2 R5K-1-MSG: slot1 (0x08):class code, revid = 0x06040002 Oct 16 11:06:59.562 EDT: %VIP2 R5K-1-MSG: slot1 (0x0C):hdr, lat timer, cls = 0x00010000 Oct 16 11:06:59.566 EDT: %VIP2 R5K-1-MSG: slot1 (0x18):sec lat,cls & bus no = 0x08010100 Oct 16 11:06:59.570 EDT: %VIP2 R5K-1-MSG: slot1 (0x1C):sec status, io base = 0x22807020 Oct 16 11:06:59.574 EDT: %VIP2 R5K-1-MSG: slot1 Received Master Abort on secondary bus Oct 16 11:06:59.578 EDT: %VIP2 R5K-1-MSG: slot1 (0x20):mem base & limit = 0x01F00000 Oct 16 11:06:59.582 EDT: %VIP2 R5K-1-MSG: slot1 (0x24):prefetch membase/lim = 0x0000FE00 Oct 16 11:06:59.586 EDT: %VIP2 R5K-1-MSG: slot1 (0x3C):bridge ctrl = 0x00030000 Oct 16 11:06:59.590 EDT: %VIP2 R5K-1-MSG: slot1 (0x40):arb/serr, chip ctrl = 0x00100000 Oct 16 11:06:59.594 EDT: %VIP2 R5K-1-MSG: slot1 (0x44):pri/sec trgt wait t. = 0x00000000 Oct 16 11:06:59.598 EDT: %VIP2 R5K-1-MSG: slot1 (0x48):sec write attmp ctr = 0x00FFFFFF Oct 16 11:06:59.606 EDT: %VIP2 R5K-1-MSG: slot1 (0x4C):pri write attmp ctr = 0x00FFFFFF Oct 16 11:06:59.610 EDT: %VIP2 R5K-1-MSG: slot1 PA Bay 1 Upstream PCI-PCI Bridge, Handle=1 Oct 16 11:06:59.614 EDT: %VIP2 R5K-1-MSG: slot1 DEC21050 bridge chip, config=0x0 Oct 16 11:06:59.618 EDT: %VIP2 R5K-1-MSG: slot1 (0x00):dev, vendor id = 0x00011011 Oct 16 11:06:59.622 EDT: %VIP2 R5K-1-MSG: slot1 (0x04):status, command = 0x02800147 Oct 16 11:06:59.626 EDT: %VIP2 R5K-1-MSG: slot1 (0x08):class code, revid = 0x06040002 Oct 16 11:06:59.630 EDT: %VIP2 R5K-1-MSG: slot1 (0x0C):hdr, lat timer, cls = 0x00010000 Oct 16 11:06:59.634 EDT: %VIP2 R5K-1-MSG: slot1 (0x18):sec lat,cls & bus no = 0x08020200 Oct 16 11:06:59.638 EDT: %VIP2 R5K-1-MSG: slot1 (0x1C):sec status, io base = 0x2280F0A0 Oct 16 11:06:59.642 EDT: %VIP2 R5K-1-MSG: slot1 Received Master Abort on secondary bus Oct 16 11:06:59.650 EDT: %VIP2 R5K-1-MSG: slot1 (0x20):mem base & limit = 0x03F00200 Oct 16 11:06:59.654 EDT: %VIP2 R5K-1-MSG: slot1 (0x24):prefetch membase/lim = 0x0000FE00 Oct 16 11:06:59.658 EDT: %VIP2 R5K-1-MSG: slot1 (0x3C):bridge ctrl = 0x00030000 Oct 16 11:06:59.662 EDT: %VIP2 R5K-1-MSG: slot1 (0x40):arb/serr, chip ctrl = 0x00100000 Oct 16 11:06:59.666 EDT: %VIP2 R5K-1-MSG: slot1 (0x44):pri/sec trgt wait t. = 0x00000000 Oct 16 11:06:59.670 EDT: %VIP2 R5K-1-MSG: slot1 (0x48):sec write attmp ctr = 0x00FFFFFF Oct 16 11:06:59.674 EDT: %VIP2 R5K-1-MSG: slot1 (0x4C):pri write attmp ctr = 0x00FFFFFF Oct 16 11:06:59.678 EDT: %VIP2 R5K-3-MSG: slot1 VIP-3-SVIP_RELOAD: SVIP Reload is called. Oct 16 11:06:59.690 EDT: %VIP2 R5K-3-MSG: slot1 VIP-3-SYSTEM_EXCEPTION: VIP System Exception occurred sig=22, code=0x0, context=0x60A8D368 Oct 16 11:07:01.714 EDT: %DBUS-3-CXBUSERR: Slot 1, CBus Error Oct 16 11:07:01.714 EDT: %DBUS-3-DBUSINTERRSWSET: Slot 1, Internal Error due to VIP crash Oct 16 11:07:01.718 EDT: %RSP-3-ERROR: End of MEMD error interrupt processing -Traceback= 60385BF0 60385C48 60386614 603513F4 Oct 16 11:07:01.842 EDT: %DBUS-3-CXBUSERR: Slot 1, CBus Error Oct 16 11:07:01.842 EDT: %DBUS-3-DBUSINTERRSWSET: Slot 1, Internal Error due to VIP crash Oct 16 11:07:05.599 EDT: %CBUS-3-CMDTIMEOUT: Cmd timed out, CCB 0x5800FF20, slot 0, cmd code 2 -Traceback= 603E3AF8 603E3FA4 603DC230 603D9F70 603A1E8C 6034462C 602FA6F0 603241C8 603241B4 Oct 16 11:07:07.623 EDT: %LINK-3-UPDOWN: Interface FastEthernet0/0, changed state to down Oct 16 11:07:08.687 EDT: %LINEPROTO-5-UPDOWN: Line protocol on Interface FastEthernet0/0, changed state to down Oct 16 11:07:16.527 EDT: %RSP-3-RESTART: cbus complex
I've never seen that happen to a Cisco router before.
I called Smirk back. “Smirk, you better get G [our Cisco consultant —Editor] on the phone to diagnose this issue. I'm out of my league.”
And thus began a few hours of scrambling to get a replacement router for the customer, and by the time I got onsite with a temporary replacement, I was told it was too late to do the change (the current router was still limping along—these crashes were happening about every half hour) and that they were planning on taking down the network at 11:00 tomorrow, so I could do the replacement then.
Wonderful!
Worse, is that this isn't the first time this router has had problems. A few weeks ago we appeared to have a similar issue and ended up replacing one of the interfaces (the errors weren't nearly as scary then). I remarked at the time that I had never seen a Cisco router go bad (I've been working with Smirk at The Company for five years now and this is a first—and even when I worked at a webhosting company in the late 90s, never saw a bad Cisco router, nor did I come across one when working at two ISPs (one in the mid 90s, and one around the turn of the century). Smirk also informed the customer, multiple times since then, that they need a redundant router, but the request never made it past a certain level of management.
Sigh.