The Boston Diaries

The ongoing saga of a programmer who doesn't live in Boston, nor does he even like Boston, but yet named his weblog/journal “The Boston Diaries.”

Go figure.

Tuesday, Debtember 08, 2009

The woodpeckers are coming

If builders built buildings the way programmers wrote programs, then the first woodpecker that came along would destroy civilization.

Sad to say that's the first thing that came to mind at the end of tonight's (or rather, this morning's) adventures.

Around midnight the Data Center In Boca Raton fell off the face of the Internet. I caught it just as it happened (checking things out on the new router I installed at a customer site some six hours earlier) and by the time I left a voice mail message to our upstream and talked to Smirk (he called as I was leaving the voice mail message), the Data Center In Boca Raton was back on the Internet.

Shortly after that, I was scanning the logs from snmptrapd (I have all our routers sending SNMP traps to a central server) I got fed up with seeing stuff like:

2009-12-06 06:28:08 XXXXXXXXXXXXXXXXXXXXXXXXX [XXXXXXXXXXXXXX]:
SNMPv2-MIB::sysUpTime.0 = Timeticks: (125022780) 14 days, 11:17:07.80
SNMPv2-MIB::snmpTrapOID.0 = OID: SNMPv2-SMI::mib-2.14.16.2.10
SNMPv2-SMI::mib-2.14.1.1 = IpAddress: XXXXXXXXXXXXXX
SNMPv2-SMI::mib-2.14.7.1.1 = IpAddress: XXXXXXXXXXXXXX
SNMPv2-SMI::mib-2.14.7.1.2 = INTEGER: 0 
SNMPv2-SMI::mib-2.14.10.1.3 = IpAddress: XXXXXXXXXXXXXX
SNMPv2-SMI::mib-2.14.16.1.3 = INTEGER: 4
SNMPv2-SMI::mib-2.14.4.1.2 = INTEGER: 1 
SNMPv2-SMI::mib-2.14.4.1.3 = IpAddress: XXXXXXXXXXXXXX
SNMPv2-SMI::mib-2.14.4.1.4 = IpAddress: XXXXXXXXXXXXXX

(only on one line). It makes it hard to figure out what the heck the router is complaining about and I wanted to change the format the MIBs to make them easier to read. I changed the command line options to snmptrapd only to get:

/usr/sbin/snmptrapd: symbol lookup error: /usr/lib/libnetsnmpmibs.so.5:
undefined symbol: netsnmp_TCPIPv6Domain

Mind you, it took a good ten minutes of scratching my head over why /etc/init.d/snmptrapd start wasn't before trying to run it at the command line.

All I know—it was running fine a few minutes before, but not now. I guess something changed in the 130 days since the server rebooted (my guess: a new version of snmptrapd without a corresponding new version of some library—did I mention I hate package managers?). No problem, as I had a locally installed copy in /usr/local/sbin/snmptrapd I could use.

I rebooted the server (it's a virtual server—takes less than a minute) when I noticed some odd issues with syslogd.

Okay, I'm not running the default syslog that comes with the distribution—no, I've been testing a homegrown syslog (which I will get around to talking about—it's quite cool) and it was basically hanging when starting up (enough that some program called minilogd was starting up, even though I have no XXXXXXX clue as to what is starting it—I can't find any reference to it in the startup scripts).

Eventually, I figure out it's blocking on a DNS lookup (I'm relaying syslog traffic to a centralized server, but that's, as Alton Brown says, is another show), which is odd, because DNS hasn't been an issue.

I check, and I see I'm only using one of the two DNS resolvers we have.

I can't resolve.

I can ping the DNS server from the server I'm on.

I can ssh to the DNS server from the server I'm on.

I just can't resolve DNS queries.

Now, the DNS resolver and the server I'm on are both virtual servers.

On the same physical computer.

The other resolver?

That's a virtual server on another physical computer and yes, I can resolve fine using that (so I set the default DNS resolver to be the one that is working while I try to troubleshoot the current issue that shouldn't be happening).

We used to have an issue with some virtual servers using that virtual DNS resolver, but I thought we had that licked months ago.

Maybe it's back?

I check iptables everywhere and no … should be fine.

A couple of hours go by.

I've finally isolated the issue—the resolver itself can't resolve.

But the other one can.

It was then I noticed some odd messages being logged to syslog and coming from our monitoring system:

HOST ALERT: XXXXXXXXXXXXXXX;DOWN;SOFT;14;
	(No Information Returned From Host Check)
HOST ALERT: XXXXXXXXXXXXXXX;DOWN;SOFT;15;CRITICAL 
	- Host Unreachable (XXXXXXXXXXXXX)
HOST ALERT: XXXXXXXXXXXXXXX;DOWN;SOFT;16;CRITICAL 
	- Host Unreachable (XXXXXXXXXXXXX)

Hmm … our monitoring system in Charlotte can't reach our resolver … okay, let's do a traceroute from Charlotte to the resolver and—

OH XXXXX XXXXXXX XXXXX ON A XXXXXXX XXXX XXXXX! No wonder I'm having DNS issues—the netblock the resolvers are on isn't being announced! WXXX TXX FXXX

That little outtage around midnight? Apparently our upstream's upstream had a slightly larger issue and couldn't route (what turned out to be) a few of our netblocks. We do have multiple connections to the Internet, but … well … it's a long story, but basically, just running BGP isn't enough—no, we have to send authorization emails to have the other provider to announce our routes that normally go through the one that had (and was still having) issues.

Okay, so the problem(s) at hand. The fact that the netblock our DNS resolvers were on weren't being announced would explain why the one resolver couldn't even resolve using itself; the other resolver probably had a larger working DNS cache and never had to send a query.

I swear, the number of “moving parts” a modern networked computer has to deal with is amazing, and it's amazing it works at all as well as it does, when it does. But man, when it breaks, it breaks and it's a bitch to troubleshoot (especially when you're doing it remotely—why even suspect the network in such a case?).

Obligatory Picture

[It's a study in contrasts—digital camera contrasts]

Obligatory Links

Obligatory Miscellaneous

You have my permission to link freely to any entry here. Go ahead, I won't bite. I promise.

The dates are the permanent links to that day's entries (or entry, if there is only one entry). The titles are the permanent links to that entry only. The format for the links are simple: Start with the base link for this site: http://boston.conman.org/, then add the date you are interested in, say 2000/08/01, so that would make the final URL:

http://boston.conman.org/2000/08/01

You can also specify the entire month by leaving off the day portion. You can even select an arbitrary portion of time.

You may also note subtle shading of the links and that's intentional: the “closer” the link is (relative to the page) the “brighter” it appears. It's an experiment in using color shading to denote the distance a link is from here. If you don't notice it, don't worry; it's not all that important.

It is assumed that every brand name, slogan, corporate name, symbol, design element, et cetera mentioned in these pages is a protected and/or trademarked entity, the sole property of its owner(s), and acknowledgement of this status is implied.

Copyright © 1999-2018 by Sean Conner. All Rights Reserved.