Friday, January 13, 2006
Monday, Part II
As if Tuesday wasn't bad enough (so was Wednesday and Thursday but that dealt with things outside of work), I arrive at The Office to one machine spewing out a sustained 20Mbps of traffic, and another one not responding at all.
For the one vomitting over the network (a customer's colocated box), I logged in, and saw the following errors:
eth0: Transmit error, Tx status register 82. Probably a duplex mismatch. See Documentation/networking/vortex.txt Flags; bus-master 1, dirty -1756485798(10) current -1756485798(10) Transmit list 00000000 vs. f7ed8480. 0: @f7ed8200 length 80000042 status 00010042 1: @f7ed8240 length 8000002e status 0001002e 2: @f7ed8280 length 800005d6 status 000105d6 3: @f7ed82c0 length 80000548 status 00010548 4: @f7ed8300 length 80000036 status 00010036 5: @f7ed8340 length 8000003e status 0001003e 6: @f7ed8380 length 80000042 status 00010042 7: @f7ed83c0 length 800005ea status 000105ea 8: @f7ed8400 length 800005ea status 000105ea 9: @f7ed8440 length 8000004e status 8001004e 10: @f7ed8480 length 800005d6 status 000105d6 11: @f7ed84c0 length 800005d6 status 000105d6 12: @f7ed8500 length 800005d6 status 000105d6 13: @f7ed8540 length 800005d6 status 000105d6 14: @f7ed8580 length 800005d6 status 000105d6 15: @f7ed85c0 length 80000042 status 00010042
Over and over again, for days—ever since it was rebooted last week. So
I check Documentation/networking/vortex.txt
:
Transmit error, Tx status register 82
This is a common error which is almost always caused by another host on the same network being in full-duplex mode, while this host is in half-duplex mode. You need to find that other host and make it run in half-duplex mode or fix this host to run in full-duplex mode.
As a last resort, you can force the 3c59x driver into full-duplex mode with
options 3c59x full_duplex=1
but this has to be viewed as a workaround for broken network gear and should only really be used for equipment which cannot autonegotiate.
I know Dan the network engineer prefers not to have auto negotiation as
he's seen problems with it, so I call him up, and ask him to set the switch
port the machine is plugged into as 100Mbps auto-negotiate
. Once he did that:
eth0: Setting full-duplex based on MII #24 link partner capability of 4101.
And the traffic immediately cleared up (I suspect that when the box was first plugged in, the switch port was set to auto-negotiate, but some time afterwards, Dan set the switch port not to negotiate, but since the negotiation already happened, no errors were reported. It was only after the box was rebooted last week that the problem cropped up).
The second box was also a colocated machine, and the customer there just told me to power cycle it. Instead of just powercycling the machine, I hooked the crash cart up to it, just to see if any error messages were on the screen.
Only the screen was blank.
Couldn't get it to unblank at all.
The machine must be hosed, so I power cycled it.
And that's when I saw the other video interface.
Sigh.
The rest of the day was pretty much like that—just as I clear one problem, another would pop up. All day I was busy with problems.
It was like Monday Tuesday all over again …
More updates on the tarpit
Labrea is actually logging about half a gig a day. Over a 24 hour period (from about 6 am Thursday to 6 am today) I'm tarpitting 82,359 connections across 2,059 unique IP addresses (24,252 connections from a single IP address). And while the number of network ports being accessed has increased a bit, it's the Microsoft specific ports that are still the most popular targets (with 72% of the scans):
Port # | Port description | # connections |
---|---|---|
Port # | Port description | # connections |
139 | NetBIOS Session Service | 24941 |
445 | Microsoft-DS Service | 23013 |
1433 | Microsoft SQL Server | 6772 |
4899 | Remote Administration | 5620 |
135 | Microsoft-RPC service | 4722 |
80 | Hypertext Transfer Protocol | 3697 |
8080 | Hypertext Transfer Protocol—typical alternative port | 1686 |
7212 | (unknown) | 1683 |
8000 | (unknown) | 1471 |
10000 | (some web based control panels use this port) | 951 |
The program I'm using to generate the stats is written in Perl, and it took about 4 hours to run over a day's worth of data (the machine that does the tarpitting isn't the fastest machine we have, but it's more than enough to dedicate to just running LaBrea). I definitely want to write a program to process LaBrea data in real time.