The Boston Diaries

Friday, January 11, 2013

Did I repeat myself? Did I repeat myself? Did I repeat myself? Let me fix that.

“Did you mean to post four copies of your post?” asked Bunny.

“No, why?”

“Because there are four copies of your post, that's why,” said Bunny.

“How do you link to the post in question like that when you speak?” I asked. She shook her head and walked away. “Bunny?”

So yes, there was a glitch last night. The only way for four copies to show up like that would be for four copies to be sent via email (which is my preferred interface). And in checking the logs, I did indeed see four copies show up:

Jan 11 05:03:58 brevard postfix/local: 084412EB72E4: to=<XXXXXXXXXXXXXXXXXXXXX>, relay=local, delay=3, status=deferred (Command died with signal 6: "/home/spc/web/sites/boston.conman.org/htdocs/boston.cgi --config /home/spc/web/sites/boston.conman.org/journal/boston.cnf --email --cmd new". Command output: *** glibc detected *** double free or corruption (fasttop): 0x09cf8d38 *** )
Jan 11 05:27:10 brevard postfix/local: 084412EB72E4: to=<XXXXXXXXXXXXXXXXXXXXX>, relay=local, delay=1395, status=deferred (Command died with signal 6: "/home/spc/web/sites/boston.conman.org/htdocs/boston.cgi --config /home/spc/web/sites/boston.conman.org/journal/boston.cnf --email --cmd new". Command output: *** glibc detected *** double free or corruption (fasttop): 0x08646d38 *** )
Jan 11 06:00:29 brevard postfix/local: 084412EB72E4: to=<XXXXXXXXXXXXXXXXXXXXX>, relay=local, delay=3394, status=deferred (Command died with signal 6: "/home/spc/web/sites/boston.conman.org/htdocs/boston.cgi --config /home/spc/web/sites/boston.conman.org/journal/boston.cnf --email --cmd new". Command output: *** glibc detected *** double free or corruption (fasttop): 0x0934bd38 *** )
Jan 11 07:07:09 brevard postfix/local: 084412EB72E4: to=<XXXXXXXXXXXXXXXXXXXXX>, relay=local, delay=7394, status=sent (delivered to command: /home/spc/web/sites/boston.conman.org/htdocs/boston.cgi --config /home/spc/web/sites/boston.conman.org/journal/boston.cnf --email --cmd new)

The fact that four showed up showed that mod_blog crashed when exiting. I find it disturbing that I've been running the latest version since the 1^st and this is the first time it crashed. Not only that, but it crashed three times in a row then worked.

Such bugs are terrifying.

So, inspired by the crash reports in Redis, I decided to roll my own and it was surprisingly easy given the output (from a test of the code):

CRASH: pid=6103 signal='Aborted'
CRASH: reason='Unspecified/untranslated error'
CRASH: CS=0073 DS=C02E007B ES=007B FS=0000 GS=0033
CRASH: EIP=00B767A2 EFL=00000246 ESP=BFE162F8 EBP=BFE1630C ESI=000017D7 EDI=00CBBFF4
CRASH: EAX=00000000 EBX=000017D7 ECX=000017D7 EDX=00000006
CRASH: UESP=BFE162F8 TRAPNO=00000000 ERR=00000000
CRASH: STACK DUMP
CRASH: BFE162F8:                         C5 78 BB 00 F4 BF CB 00 
CRASH: BFE16300: F4 BF CB 00 00 00 00 00 C0 C6 F0 B7 3C 64 E1 BF 
CRASH: BFE16310: 29 93 BB 00 06 00 00 00 20 63 E1 BF 00 00 00 00 
CRASH: BFE16320: 20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
CRASH: BFE16330: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
CRASH: BFE16340: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
CRASH: BFE16350: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
CRASH: BFE16360: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
CRASH: BFE16370: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
CRASH: BFE16380: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
CRASH: BFE16390: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
CRASH: BFE163A0: A7 C4 CB 00 A8 C4 CB 00 F4 BF CB 00 A7 C4 CB 00 
CRASH: BFE163B0: 60 C4 CB 00 D8 63 E1 BF 73 D0 C4 00 F4 BF CB 00 
CRASH: BFE163C0: 4F E7 BE 00 02 00 00 00 B8 C6 54 08 42 00 00 00 
CRASH: BFE163D0: B8 C6 54 08 42 00 00 00 14 64 E1 BF 9C E9 BE 00 
CRASH: BFE163E0: 60 C4 CB 00 B8 C6 54 08 42 00 00 00 32 30 31 34 
CRASH: BFE163F0: A7 C4 CB 00 00 00 00 00                         
CRASH: STACK TRACE
CRASH:        /home/spc/source/boston/build/boston[0x805d3c4]
CRASH:        /home/spc/source/boston/build/boston[0x805daa2]
CRASH:        /lib/tls/libc.so.6[0xbb79b0]
CRASH:        /lib/tls/libc.so.6(abort+0xe9)[0xbb9329]
CRASH:        /lib/tls/libc.so.6(__assert_fail+0x101)[0xbb0e41]
CRASH:        /home/spc/source/boston/build/boston[0x8053f30]
CRASH:        /home/spc/source/boston/build/boston[0x805d0b9]
CRASH:        /home/spc/source/boston/build/boston(ChunkProcessStream+0xc8)[0x805d32e]
CRASH:        /home/spc/source/boston/build/boston(ChunkProcess+0xcb)[0x805d240]
CRASH:        /home/spc/source/boston/build/boston(generic_cb+0x74)[0x8053a68]
CRASH:        /home/spc/source/boston/build/boston(pagegen_days+0x1c4)[0x8057f5c]
CRASH:        /home/spc/source/boston/build/boston(generate_pages+0xd0)[0x8057bfc]
CRASH:        /home/spc/source/boston/build/boston[0x80501bb]
CRASH:        /home/spc/source/boston/build/boston(main_cli+0x1e5)[0x80500cd]
CRASH:        /home/spc/source/boston/build/boston(main+0x133)[0x8051eb3]
CRASH:        /lib/tls/libc.so.6(__libc_start_main+0xd3)[0xba4e93]
CRASH:        /home/spc/source/boston/build/boston[0x804cdbd]

POSIX allows you to register a signal handler with additional contextual information. This contextual information includes the contents of the CPU registers and it's just a matter of finding the structure definition in a header file. Once I have the registers, it's not hard to get the stack dump.

The stack trace was even eaiser than that. It seems that the Linux C library comes with a few functions to do just that—backtrace() collects the return addresses off the stack; backtrace_symbols() will resolve the addresses back to function names and backtrace_symbols_fd() does the same, only it writes the informtaion to a file. I ended up using backtrace_symbols_fd() since I'm executing in the context of a signal handler, and I know that open(), read() and close() are marked as “async-signal-safe” (that is, safe to call in a signal handler) by POSIX, but malloc() and free() are not, and backtrace_symbols() is known to call malloc() (and thus, I would need to call free()) and I don't want to risk a crash in the crash report code. I'm taking a risk because backtrace() and backtrace_symbols_fd() could call non-async-signal-safe functions, but I'm doing as much as I can to get as much context as I can as safely as I can.

The utility of a crash report

Well, that was fast. I installed the new version of mod_blog, went out to dinner with Bunny, and came home to over 80 crash reports, reason: “Address not mapped for object.” Or in other words, “the program tried to access memory that didn't belong to it.”

Bad news—it wasn't every request, just certain ones. Even worse news—the logging server and the web server have a 3′50″ difference in time (that took me about fifteen minutes to realize). I also realized that even with stack dump and trace, there still wasn't enough contextual information to figure out what was going on.

I decided to dump the command line arguments (a hack—I have to store the arguments in a global variable and make crashreport(), a library function, depend upon these global variables, which is bad form) and the environment variables (less of a hack—the system already has these in a global variable). With that, I was able to track down the issue.

The issue—some script kiddies are trying to hit the webpage that creates webpages. This isn't normally an issue, since the POST method requires authentication, but these script kiddies were trying to create entries via GET, and the error that was generating wasn't being properly checked (since that shouldn't happen—oops). And because of that, some unititialized variables were dereferenced and boom! A crash report.

The surprising thing about all this is that Apache not once reported the CGI program crashing. To me, that seems fairly obvious, but apparently not. So there's no telling how long this has been happening (yeah, you can tell I check the logs often, can't you?).

I do need to think about how to handle command line arguments in crashreport(). It's library code and thus, shouldn't rely on global variables. I have to think about this …

Friday, January 11, 2013

Did I repeat myself? Did I repeat myself? Did I repeat myself? Let me fix that.

The utility of a crash report

Obligatory Picture

Obligatory Contact Info

Obligatory Feeds

Obligatory Links

Obligatory Miscellaneous

Obligatory AI Disclaimer