Friday, January 07, 2022
“You can't fix the bug until you have filled out form 27B/6.”
I swear, I just cannot adjust to the new development process. Last month, as I was starting work on “Project: Bradenburg,” I ended up using GCC 11, and just on a whim, I decided to try GCC 11 on “Project: Lumbergh” and that's when I found an issue—“Project: Lumbergh” crashed immediately. It took about ten minutes to track down the issue—undefined behavior! Effectively, “Project: Lumbergh” was taking a pointer to a variable in an inner scope and using it outside said scope (don't worry, there's an example of what I'm talking below). I duly recorded a bug in Jira, but did not check in the patch as the other developer, CZ, was doing a bunch of work on “Project: Lumbergh” and I wanted to run the bug by him first just so he knew about it. In retrospect, I think I should have just checked in the patch and asked forgiveness, because the ensuing “process” nearly killed me.
First off, the bug basically languished due to the holliday season and everybody taking vacation and what not. When I brought up the bug on Monday, CZ demanded a test case to show it failing. Problem—there was no reason to create a specific test case, because any test would have shown the crash. Second problem—it's undefined behavior—it only shows (as far as I know) when compiled with GCC 11. We don't use GCC 11 in production, and the GCC version we use in production produces code that runs, thus not finding the bug any ealier.
CZ tried using valgrind
to locate the issue and couldn't.
Of course CZ couldn't—the executable worked.
As I tried to state,
it only fails when compiled with GCC 11
(which we as a company don't use yet).
I spent the week trying to convince him to just accept the patch because it removed undefined behavior. CZ was adamant in trying to reproduce the issue for himself. Because with our Corporate Overlords, all bugs must be reproducable, must have a test case, and must pass said test case, before it will be patched.
Even in the case of blatant undefined behavior!
ARE YOU XXXXXXX KIDDING ME? IS OUR CODE SO XXXXXXX FRAGILE THAT MOVING THREE VARIABLES DECLARATIONS WILL XXXXXXX BREAK THE PROGRAM? IS THAT WHAT YOU ARE SO XXXXXXX AFRAID OF? XXXXXXXXXXXXXXXXXXXX!
[Calm down! Breath! In … out … in … out. —Editor]
Sigh.
Eventually, I was given the go-ahead to submit the patch. Here was my revision notes:
Bug fix XXXXXX—remove undefined behavior
This was found using GCC 11, and it's very hard to reproduce, since the code invoked undefined behavior by storing pointers to data in a scope that technically no longer exists when it's referenced.
The reason this is hard to find depends upon how the compiler lays out the stack frame for the given function. It can, for instance, collect all the variable created at all the scopes in the function and set aside space for all of them at once, which in that case the code will work without issue. Or it could collect all the variables from all the scopes, and create just enough space for all the variables, allowing the variables from one scope to reuse memory for another scope, in which case, it may work, or not, depending upon intervening scopes. Or, the compiler can create the scoped variables when the code enters the scope, in which case the code may lead to undeterministic behavior, depending upon the code following the scope.
An example:
struct bar { char b[MAX]; }; struct foo { char f[MAX]; struct bar *sub; }; static int foo(A a, B b, C c) { struct foo f; bool flag; flag = to_display(f.f,sizeof(f.f),a); if (flag) { struct bar b; b.b = to_other_display(b.b,sizeof(b.b),b,c); f.sub = &b; } else f.sub = NULL; if (check_this()) { int x = tointeger(b); char buf[MAX]; to_c_display(buf,sizeof(buf),c); syslog(LOG_DEBUG,"check=%d c=%s",x,buf); } do_something(&f); return 0; }In this sample function, compiler A may collect all the local variables into a single block at the start of the function:
struct foo f; bool flag; struct bar b; int x; char buf[MAX];and even though this provokes undefined behavior, it will still work, and valgrind won't find an issue. Compiler B might create space for all the variables, but could reuse the space for the two sub blocks:
struct foo f; bool flag; union { struct bar b; struct { int x ; char buf[MAX]; } };(not valid C syntax, but it should give the intent I'm going for). In this case, the space pointed to by b could be overwritten if check_this() returns true. Compiler C may only create the space as needed, so the resulting assembly code may look something like:
foo push rbp mov rbp,rsp sub rsp,sizeof(struct foo) + sizeof(flag) + padding … cmp al,[rbp+flag] jz foo_skip sub rsp,sizeof(struct bar) + padding … lea rbx,[rbp - f]; lea rdx,[rbp - b]; mov [rbx + sub],rdx … add rsp,sizeof(struct bar) + padding foo_skip: … call check_this tst al jz foo_skip2 sub rsp,sizeof(int) + sizeof(buf) + padding … add rsp,sizeof(int) + sizeof(buf) + padding foo_skip2: call do_somethingAnd in this case, if check_this() even returns false, data in struct bar could be overwritten just with the call to check_this() itself.
This is the very definition of “undefined behavior” and in this case, it's might have been hard to find, even with tools like valgrind, since it really depends upon the code generated by the compiler. I think we're lucky in that GCC 11 layed out the code such that it crashed and thus, I was able to isolate the issue long before it becomes important.