Reading my email at work, I see this one from E, my boss:
[Out department] just observed a momentary connectivity problem with the [city 1] data center. This was caused by the [city 2] circuit bouncing and traffic being momentarily routed through [city 3]. The circuit restored itself and appears normal at this time, however the [primary] NOC has opened [a trouble ticket] and is monitoring this circuit for additional problems.
I don't know why this upset me so much. I mean, the other email I got tonight was much worse than this (again, critical of the work done on third shift, like what else is new) but I think it has something to do with a misconception of just how the Internet works.
The closest city mentioned is well over 600 miles away, on The Company backbone (technically, we're on The Company backbone, but being way down here in South Florida, it's more of a spur than a backbone. A fast spur yes, but a spur nonetheless). And on a network the size we have here (even the network here in Boca Raton is a site to see and I don't think it's the largest one The Company has) it's expected there to be some transitory glitches. I mean, why else have redundant routes?
I think what bugs me is that most of our clients (heck, even our fellow employees) don't realize that TCP/IP is an unreliable (read: best effort) protocol and there are no guarentees on delivery of packets. And that if the routers are configured correctly (and we have some pretty sharp network engineers working here—I mean, The Company does run a backbone here) that one router doing down won't effect the connectivity (since there are redundant routes throughout the routing mesh). Okay, it may take a few minutes, but that's to be expected as routing tables restabalize.
Let me say that again: it may take a few minutes!
Like email delivery. We get calls from customers complaining that it's taking fifteen minutes for email to get through. Thankfully I'm not on the phones least I be tempted to tell these people that email is not an instant messaging protocol and that it's faster than the alternative, the snail mail postal service. Depending upon network congestion, it may take several hours for email to get through.
I'm fond of saying “Email is not FTP.” I may have to amend that to be: “Email is not FTP nor IM!”
This is also related to an incident that happened a few days ago. Just as 1st shift was coming in, we lost connectivity with a machine we manage over in Europe. Doing a traceroute showed the loss of connectivity happening within Cable and Wireless, about three hops into their network. My thought—okay, it's not us, just inform the techs that we can't reach the machine in question and it's outside our hands.
Yet JM, 1st shift worker, came in, and opened a trouble ticket with our primary NOC, even though the outtage was in a different backbone several hops inside. For some reason that bothered me too, I think for similar reasons. Given the amount of email we get about outtages of routers all along our backbone you'd think our networking staff would have enough work just on our equipment; why bother adding to the case load about other companies' equipment? But alas, JM did it anway, as part of the CYA attitude around here.