The Boston Diaries

The ongoing saga of a programmer who doesn't live in Boston, nor does he even like Boston, but yet named his weblog/journal “The Boston Diaries.”

Go figure.

Friday, January 07, 2022

“You can't fix the bug until you have filled out form 27B/6.”

I swear, I just cannot adjust to the new development process. Last month, as I was starting work on “Project: Bradenburg,” I ended up using GCC 11, and just on a whim, I decided to try GCC 11 on “Project: Lumbergh” and that's when I found an issue—“Project: Lumbergh” crashed immediately. It took about ten minutes to track down the issue—undefined behavior! Effectively, “Project: Lumbergh” was taking a pointer to a variable in an inner scope and using it outside said scope (don't worry, there's an example of what I'm talking below). I duly recorded a bug in Jira, but did not check in the patch as the other developer, CZ, was doing a bunch of work on “Project: Lumbergh” and I wanted to run the bug by him first just so he knew about it. In retrospect, I think I should have just checked in the patch and asked forgiveness, because the ensuing “process” nearly killed me.

First off, the bug basically languished due to the holliday season and everybody taking vacation and what not. When I brought up the bug on Monday, CZ demanded a test case to show it failing. Problem—there was no reason to create a specific test case, because any test would have shown the crash. Second problem—it's undefined behavior—it only shows (as far as I know) when compiled with GCC 11. We don't use GCC 11 in production, and the GCC version we use in production produces code that runs, thus not finding the bug any ealier.

CZ tried using valgrind to locate the issue and couldn't. Of course CZ couldn't—the executable worked. As I tried to state, it only fails when compiled with GCC 11 (which we as a company don't use yet).

I spent the week trying to convince him to just accept the patch because it removed undefined behavior. CZ was adamant in trying to reproduce the issue for himself. Because with our Corporate Overlords, all bugs must be reproducable, must have a test case, and must pass said test case, before it will be patched.

Even in the case of blatant undefined behavior!

ARE YOU XXXXX­XX KIDDING ME? IS OUR CODE SO XXXXX­XX FRAGILE THAT MOVING THREE VARIABLES DECLARATIONS WILL XXXXX­XX BREAK THE PROGRAM? IS THAT WHAT YOU ARE SO XXXXX­XX AFRAID OF? XXXXX­XXXXX­XXXXX­XXXXX!

[Calm down! Breath! In … out … in … out. —Editor]

Sigh.

Eventually, I was given the go-ahead to submit the patch. Here was my revision notes:

Bug fix XXXXX­X—remove undefined behavior

This was found using GCC 11, and it's very hard to reproduce, since the code invoked undefined behavior by storing pointers to data in a scope that technically no longer exists when it's referenced.

The reason this is hard to find depends upon how the compiler lays out the stack frame for the given function. It can, for instance, collect all the variable created at all the scopes in the function and set aside space for all of them at once, which in that case the code will work without issue. Or it could collect all the variables from all the scopes, and create just enough space for all the variables, allowing the variables from one scope to reuse memory for another scope, in which case, it may work, or not, depending upon intervening scopes. Or, the compiler can create the scoped variables when the code enters the scope, in which case the code may lead to undeterministic behavior, depending upon the code following the scope.

An example:

struct bar
{
  char b[MAX];
};

struct foo
{
  char        f[MAX];
  struct bar *sub;
};

static int foo(A a, B b, C c)
{
  struct foo f;	  
  bool       flag;
  
  flag = to_display(f.f,sizeof(f.f),a);
  if (flag)
  {
    struct bar b;
    
    b.b = to_other_display(b.b,sizeof(b.b),b,c);
    f.sub = &b;
  }
  else
    f.sub = NULL;
    
  if (check_this())
  {
    int x = tointeger(b);
    char buf[MAX];
    
    to_c_display(buf,sizeof(buf),c); 
    syslog(LOG_DEBUG,"check=%d c=%s",x,buf);
  }
    
  do_something(&f);
  return 0;
}

In this sample function, compiler A may collect all the local variables into a single block at the start of the function:

struct foo f;
bool       flag;
struct bar b;
int        x;
char       buf[MAX];

and even though this provokes undefined behavior, it will still work, and valgrind won't find an issue. Compiler B might create space for all the variables, but could reuse the space for the two sub blocks:

struct foo f;
bool       flag;
union
{
  struct bar b;
  struct { int x ; char buf[MAX]; }
};

(not valid C syntax, but it should give the intent I'm going for). In this case, the space pointed to by b could be overwritten if check_this() returns true. Compiler C may only create the space as needed, so the resulting assembly code may look something like:

foo	
	push	rbp
	mov	rbp,rsp
	sub	rsp,sizeof(struct foo) + sizeof(flag) + padding
	…
	cmp	al,[rbp+flag]
	jz	foo_skip

	sub	rsp,sizeof(struct bar) + padding
	…
	lea	rbx,[rbp - f];
	lea	rdx,[rbp - b];
	mov	[rbx + sub],rdx
	…
	add	rsp,sizeof(struct bar) + padding

foo_skip:
	…
	call	check_this
	tst	al
	jz	foo_skip2

	sub	rsp,sizeof(int) + sizeof(buf) + padding
	…
	add	rsp,sizeof(int) + sizeof(buf) + padding

foo_skip2:

	call	do_something

And in this case, if check_this() even returns false, data in struct bar could be overwritten just with the call to check_this() itself.

This is the very definition of “undefined behavior” and in this case, it's might have been hard to find, even with tools like valgrind, since it really depends upon the code generated by the compiler. I think we're lucky in that GCC 11 layed out the code such that it crashed and thus, I was able to isolate the issue long before it becomes important.

Obligatory Picture

[The future's so bright, I gotta wear shades]

Obligatory Contact Info

Obligatory Feeds

Obligatory Links

Obligatory Miscellaneous

You have my permission to link freely to any entry here. Go ahead, I won't bite. I promise.

The dates are the permanent links to that day's entries (or entry, if there is only one entry). The titles are the permanent links to that entry only. The format for the links are simple: Start with the base link for this site: https://boston.conman.org/, then add the date you are interested in, say 2000/08/01, so that would make the final URL:

https://boston.conman.org/2000/08/01

You can also specify the entire month by leaving off the day portion. You can even select an arbitrary portion of time.

You may also note subtle shading of the links and that's intentional: the “closer” the link is (relative to the page) the “brighter” it appears. It's an experiment in using color shading to denote the distance a link is from here. If you don't notice it, don't worry; it's not all that important.

It is assumed that every brand name, slogan, corporate name, symbol, design element, et cetera mentioned in these pages is a protected and/or trademarked entity, the sole property of its owner(s), and acknowledgement of this status is implied.

Copyright © 1999-2024 by Sean Conner. All Rights Reserved.