Friday, January 11, 2013
Did I repeat myself? Did I repeat myself? Did I repeat myself? Let me fix that.
“Did you mean to post four copies of your post?” asked Bunny.
“No, why?”
“Because there are four copies of your post, that's why,” said Bunny.
“How do you link to the post in question like that when you speak?” I asked. She shook her head and walked away. “Bunny?”
So yes, there was a glitch last night. The only way for four copies to show up like that would be for four copies to be sent via email (which is my preferred interface). And in checking the logs, I did indeed see four copies show up:
Jan 11 05:03:58 brevard postfix/local: 084412EB72E4: to=<XXXXXXXXXXXXXXXXXXXXX>, relay=local, delay=3, status=deferred (Command died with signal 6: "/home/spc/web/sites/boston.conman.org/htdocs/boston.cgi --config /home/spc/web/sites/boston.conman.org/journal/boston.cnf --email --cmd new". Command output: *** glibc detected *** double free or corruption (fasttop): 0x09cf8d38 *** ) Jan 11 05:27:10 brevard postfix/local: 084412EB72E4: to=<XXXXXXXXXXXXXXXXXXXXX>, relay=local, delay=1395, status=deferred (Command died with signal 6: "/home/spc/web/sites/boston.conman.org/htdocs/boston.cgi --config /home/spc/web/sites/boston.conman.org/journal/boston.cnf --email --cmd new". Command output: *** glibc detected *** double free or corruption (fasttop): 0x08646d38 *** ) Jan 11 06:00:29 brevard postfix/local: 084412EB72E4: to=<XXXXXXXXXXXXXXXXXXXXX>, relay=local, delay=3394, status=deferred (Command died with signal 6: "/home/spc/web/sites/boston.conman.org/htdocs/boston.cgi --config /home/spc/web/sites/boston.conman.org/journal/boston.cnf --email --cmd new". Command output: *** glibc detected *** double free or corruption (fasttop): 0x0934bd38 *** ) Jan 11 07:07:09 brevard postfix/local: 084412EB72E4: to=<XXXXXXXXXXXXXXXXXXXXX>, relay=local, delay=7394, status=sent (delivered to command: /home/spc/web/sites/boston.conman.org/htdocs/boston.cgi --config /home/spc/web/sites/boston.conman.org/journal/boston.cnf --email --cmd new)
The fact that four showed up showed that mod_blog
crashed
when exiting. I find it disturbing that I've been running the latest
version since the 1st
and this is the first time it crashed. Not only that, but it crashed three
times in a row then worked.
Such bugs are terrifying.
So, inspired by the crash reports in Redis, I decided to roll my own and it was surprisingly easy given the output (from a test of the code):
CRASH: pid=6103 signal='Aborted' CRASH: reason='Unspecified/untranslated error' CRASH: CS=0073 DS=C02E007B ES=007B FS=0000 GS=0033 CRASH: EIP=00B767A2 EFL=00000246 ESP=BFE162F8 EBP=BFE1630C ESI=000017D7 EDI=00CBBFF4 CRASH: EAX=00000000 EBX=000017D7 ECX=000017D7 EDX=00000006 CRASH: UESP=BFE162F8 TRAPNO=00000000 ERR=00000000 CRASH: STACK DUMP CRASH: BFE162F8: C5 78 BB 00 F4 BF CB 00 CRASH: BFE16300: F4 BF CB 00 00 00 00 00 C0 C6 F0 B7 3C 64 E1 BF CRASH: BFE16310: 29 93 BB 00 06 00 00 00 20 63 E1 BF 00 00 00 00 CRASH: BFE16320: 20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 CRASH: BFE16330: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 CRASH: BFE16340: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 CRASH: BFE16350: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 CRASH: BFE16360: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 CRASH: BFE16370: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 CRASH: BFE16380: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 CRASH: BFE16390: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 CRASH: BFE163A0: A7 C4 CB 00 A8 C4 CB 00 F4 BF CB 00 A7 C4 CB 00 CRASH: BFE163B0: 60 C4 CB 00 D8 63 E1 BF 73 D0 C4 00 F4 BF CB 00 CRASH: BFE163C0: 4F E7 BE 00 02 00 00 00 B8 C6 54 08 42 00 00 00 CRASH: BFE163D0: B8 C6 54 08 42 00 00 00 14 64 E1 BF 9C E9 BE 00 CRASH: BFE163E0: 60 C4 CB 00 B8 C6 54 08 42 00 00 00 32 30 31 34 CRASH: BFE163F0: A7 C4 CB 00 00 00 00 00 CRASH: STACK TRACE CRASH: /home/spc/source/boston/build/boston[0x805d3c4] CRASH: /home/spc/source/boston/build/boston[0x805daa2] CRASH: /lib/tls/libc.so.6[0xbb79b0] CRASH: /lib/tls/libc.so.6(abort+0xe9)[0xbb9329] CRASH: /lib/tls/libc.so.6(__assert_fail+0x101)[0xbb0e41] CRASH: /home/spc/source/boston/build/boston[0x8053f30] CRASH: /home/spc/source/boston/build/boston[0x805d0b9] CRASH: /home/spc/source/boston/build/boston(ChunkProcessStream+0xc8)[0x805d32e] CRASH: /home/spc/source/boston/build/boston(ChunkProcess+0xcb)[0x805d240] CRASH: /home/spc/source/boston/build/boston(generic_cb+0x74)[0x8053a68] CRASH: /home/spc/source/boston/build/boston(pagegen_days+0x1c4)[0x8057f5c] CRASH: /home/spc/source/boston/build/boston(generate_pages+0xd0)[0x8057bfc] CRASH: /home/spc/source/boston/build/boston[0x80501bb] CRASH: /home/spc/source/boston/build/boston(main_cli+0x1e5)[0x80500cd] CRASH: /home/spc/source/boston/build/boston(main+0x133)[0x8051eb3] CRASH: /lib/tls/libc.so.6(__libc_start_main+0xd3)[0xba4e93] CRASH: /home/spc/source/boston/build/boston[0x804cdbd]
POSIX allows you to register a signal handler with additional contextual information. This contextual information includes the contents of the CPU registers and it's just a matter of finding the structure definition in a header file. Once I have the registers, it's not hard to get the stack dump.
The stack trace was even eaiser than that. It seems that the Linux C
library comes with a few functions to do just
that—backtrace()
collects the return addresses off the stack;
backtrace_symbols()
will resolve the addresses back to function
names and backtrace_symbols_fd()
does the same, only it writes
the informtaion to a file. I ended up using
backtrace_symbols_fd()
since I'm executing in the context of a
signal handler, and I know that open()
, read()
and
close()
are marked as “async-signal-safe” (that is, safe to
call in a signal handler) by POSIX, but malloc()
and
free()
are not, and backtrace_symbols()
is known
to call malloc()
(and thus, I would need to call
free()
) and I don't want to risk a crash in the crash report
code. I'm taking a risk because backtrace()
and
backtrace_symbols_fd()
could call non-async-signal-safe
functions, but I'm doing as much as I can to get as much context as I can as
safely as I can.
The utility of a crash report
Well, that was fast. I installed the new version of mod_blog
, went out
to dinner with Bunny, and came home to over 80 crash reports, reason:
“Address not mapped for object.” Or in other words, “the program tried to
access memory that didn't belong to it.”
Bad news—it wasn't every request, just certain ones. Even worse news—the logging server and the web server have a 3′50″ difference in time (that took me about fifteen minutes to realize). I also realized that even with stack dump and trace, there still wasn't enough contextual information to figure out what was going on.
I decided to dump the command line arguments (a hack—I have to store
the arguments in a global variable and make crashreport()
, a
library function, depend upon these global variables, which is bad form) and the environment variables
(less of a hack—the system already has these in a global variable). With
that, I was able to track down the issue.
The issue—some script kiddies are trying to hit the webpage that
creates webpages. This isn't normally an issue, since the POST
method requires authentication, but these script kiddies were trying to
create entries via GET
, and the error that was
generating wasn't being properly checked (since that shouldn't
happen—oops). And because of that, some unititialized variables were
dereferenced and boom! A crash report.
The surprising thing about all this is that Apache not once reported the CGI program crashing. To me, that seems fairly obvious, but apparently not. So there's no telling how long this has been happening (yeah, you can tell I check the logs often, can't you?).
I do need to think about how to handle command line arguments in
crashreport()
. It's library code and thus, shouldn't rely on
global variables. I have to think about this …