If builders built buildings the way programmers wrote programs, then the first woodpecker that came along would destroy civilization.
Sad to say that's the first thing that came to mind at the end of tonight's (or rather, this morning's) adventures.
Around midnight the Data Center In Boca Raton fell off the face of the Internet. I caught it just as it happened (checking things out on the new router I installed at a customer site some six hours earlier) and by the time I left a voice mail message to our upstream and talked to Smirk (he called as I was leaving the voice mail message), the Data Center In Boca Raton was back on the Internet.
Shortly after that, I was scanning the logs from
(I have all our routers sending SNMP traps to a central server) I got fed up with seeing
2009-12-06 06:28:08 XXXXXXXXXXXXXXXXXXXXXXXXX [XXXXXXXXXXXXXX]: SNMPv2-MIB::sysUpTime.0 = Timeticks: (125022780) 14 days, 11:17:07.80 SNMPv2-MIB::snmpTrapOID.0 = OID: SNMPv2-SMI::mib-18.104.22.168.10 SNMPv2-SMI::mib-22.214.171.124 = IpAddress: XXXXXXXXXXXXXX SNMPv2-SMI::mib-126.96.36.199.1 = IpAddress: XXXXXXXXXXXXXX SNMPv2-SMI::mib-188.8.131.52.2 = INTEGER: 0 SNMPv2-SMI::mib-184.108.40.206.3 = IpAddress: XXXXXXXXXXXXXX SNMPv2-SMI::mib-220.127.116.11.3 = INTEGER: 4 SNMPv2-SMI::mib-18.104.22.168.2 = INTEGER: 1 SNMPv2-SMI::mib-22.214.171.124.3 = IpAddress: XXXXXXXXXXXXXX SNMPv2-SMI::mib-126.96.36.199.4 = IpAddress: XXXXXXXXXXXXXX
(only on one line). It makes it hard to figure out what the heck the
router is complaining about and I wanted to change the format the MIBs to make them easier to
read. I changed the command line options to
snmptrapd only to
/usr/sbin/snmptrapd: symbol lookup error: /usr/lib/libnetsnmpmibs.so.5: undefined symbol: netsnmp_TCPIPv6Domain
Mind you, it took a good ten minutes of scratching my head over why
/etc/init.d/snmptrapd start wasn't before trying to run it at
the command line.
All I know—it was running fine a few minutes before, but not now. I
guess something changed in the 130 days since the server rebooted (my guess:
a new version of
snmptrapd without a corresponding new version
of some library—did I mention I hate package managers?). No problem, as I
had a locally installed copy in
I rebooted the server (it's a virtual server—takes less than a minute)
when I noticed some odd issues with
Okay, I'm not running the default
syslog that comes with the
distribution—no, I've been testing a homegrown
I will get around to talking about—it's quite cool) and it was basically
hanging when starting up (enough that some program called
minilogd was starting up, even though I have no XXXXXXX clue as to what is starting it—I can't find any
reference to it in the startup scripts).
Eventually, I figure out it's blocking on a DNS lookup (I'm relaying
syslog traffic to a
centralized server, but that's, as Alton Brown says, is another show),
which is odd, because DNS
hasn't been an issue.
I check, and I see I'm only using one of the two DNS resolvers we have.
I can't resolve.
ping the DNS server from the server I'm on.
ssh to the DNS server from the server I'm on.
I just can't resolve DNS queries.
Now, the DNS resolver and the server I'm on are both virtual servers.
On the same physical computer.
The other resolver?
That's a virtual server on another physical computer and yes, I can resolve fine using that (so I set the default DNS resolver to be the one that is working while I try to troubleshoot the current issue that shouldn't be happening).
We used to have an issue with some virtual servers using that virtual DNS resolver, but I thought we had that licked months ago.
Maybe it's back?
iptables everywhere and no … should be fine.
A couple of hours go by.
I've finally isolated the issue—the resolver itself can't resolve.
But the other one can.
It was then I noticed some odd messages being logged to
syslog and coming from our monitoring system:
HOST ALERT: XXXXXXXXXXXXXXX;DOWN;SOFT;14; (No Information Returned From Host Check) HOST ALERT: XXXXXXXXXXXXXXX;DOWN;SOFT;15;CRITICAL - Host Unreachable (XXXXXXXXXXXXX) HOST ALERT: XXXXXXXXXXXXXXX;DOWN;SOFT;16;CRITICAL - Host Unreachable (XXXXXXXXXXXXX)
Hmm … our monitoring system in Charlotte can't reach our resolver …
okay, let's do a
traceroute from Charlotte to the resolver
OH XXXXX XXXXXXX XXXXX ON A XXXXXXX XXXX XXXXX! No wonder I'm having DNS issues—the netblock the resolvers are on isn't being announced! WXXX TXX FXXX‽
That little outtage around midnight? Apparently our upstream's upstream had a slightly larger issue and couldn't route (what turned out to be) a few of our netblocks. We do have multiple connections to the Internet, but … well … it's a long story, but basically, just running BGP isn't enough—no, we have to send authorization emails to have the other provider to announce our routes that normally go through the one that had (and was still having) issues.
Okay, so the problem(s) at hand. The fact that the netblock our DNS resolvers were on weren't being announced would explain why the one resolver couldn't even resolve using itself; the other resolver probably had a larger working DNS cache and never had to send a query.
I swear, the number of “moving parts” a modern networked computer has to deal with is amazing, and it's amazing it works at all as well as it does, when it does. But man, when it breaks, it breaks and it's a bitch to troubleshoot (especially when you're doing it remotely—why even suspect the network in such a case?).