Saturday, January 07, 2017
“Every time the shipper takes away a pallet from the shipping room, the server times out within two seconds.”
Last year at my job we had a pretty severe problem just as unexplainable.
The day after an unscheduled closing (hurricane), I started getting calls from users complaining about database connection timeouts. Since I had a very simple network with less than 32 nodes and barely any bandwidth in use, it was quite scary that I could ping to the database server for 15- 20 minutes and then get "request timed out" for about 2 minutes. I had performance monitors etc. running on the server and was pinging the server from multiple sources. Pretty much every machine except the server was able to talk to the others constantly. I tried to isolate a faulty switch or a bad connection but there was no way to explain the random yet periodic failures.
I asked my coworker to observe the lights on a switch in the warehouse while I ran trace routes and unplugged different devices. After 45-50 minutes on the walkie-talkie with him saying "ya it's down, ok it's back up," I asked if he noticed any patterns. He said, "Yeah… I did. But you're going to think I'm nuts. Every time the shipper takes away a pallet from the shipping room, the server times out within 2 seconds." I said "WHAT???" He said "Yeah. And the server comes back up once he starts processing the next order."
Via Hacker News, chime comments on The case of the 500-mile email
This is every bit as amusing as the 500-mile email and shows that bugs can be very hard to debug, especially when they aren't caused by bug-ridden code.
I'm fortunate in that I've never had to debug such issues.