Assert yourself

Tuesday, November 06, 2018

Defensive programming is often touted as a better coding style, but it hides bugs. Remember, the errors we're talking about should never happen, and by safely handling them, you make it harder to write bug-free code.

… you don't want to hide bugs by programming defensively …

… No matter where you employ the defensive style, ask youself, “Am I hiding bugs in this code by using defensive programming?” If you might be, add some assertions to alert you to those bugs.

Writing Solid Code

Writing Solid Code is the rare book that forced me to change how I go about programming. I feel I'm in the minority, but after reading that book, I hate defensive programming. Don't get me wrong—at the input/output boundary, you need to be absolutely paranoid about checking data, but among functions? Not so paranoid.

And now class, story time …

“Project: Cleese” was installed onto the QA system the other day, and by chance today, I noticed a core file produced by said program. This was odd, since both I and T (the QA engineer assigned to our team) had tested the program without incident.

I was able to isolate the crash to freeaddrinfo(), a function used to release memory used by getaddrinfo() when converting a domain name like “boston.conman.org” to an IP address. A summary of the code in question:

struct addrinfo  hints;
struct addrinfo *results;
const char      *hostname;
const char      *port;
int              rc;

memset(&hints,0,sizeof(hints));

results  = NULL;
hostname = ... ;
port     = ... ;

// code code ;

rc = getaddrinfo(hostname,port,&hints,&results);

// code code

for ( ; results != NULL ; results = results->ai_next)
{
  if (results->ai_protocol == protocol)
  {
    // code code
  }
}

freeaddrinfo(results);

It's a rookie mistake but hey, it happens. The issue is that results is linked list of results, which is traversed. By the time freeaddrinfo() is called, results is now NULL. Under Linux and Mac OS-X, it seems that freeaddrinfo() checks if it's given a NULL pointer and … just does nothing if it is. It doesn't crash, but it does leak memory (it's not much in this case, since this function is only called once upon startup, but a leak is still a leak). Linux and Mac OS-X use defensive programming, probably something along the lines of:

void freeaddrinfo(struct addrinfo *info)
{
  if (info == NULL)
    return;
  // code code code 
}

which hid a bug. Solaris (which we have to use for reasons) is not so forgiving and immedately crashed.

So on Linux and Mac OS-X, how would one even test for this type of issue? The code doesn't crash. It returns results. Yes, valgrind can easily find it:

[spc]lucy:/tmp>valgrind --leak-check=full --show-reachable=yes `which lua` x.lua
==31304== Memcheck, a memory error detector.
==31304== Copyright (C) 2002-2005, and GNU GPL'd, by Julian Seward et al.
==31304== Using LibVEX rev 1575, a library for dynamic binary translation.
==31304== Copyright (C) 2004-2005, and GNU GPL'd, by OpenWorks LLP.
==31304== Using valgrind-3.1.1, a dynamic binary instrumentation framework.
==31304== Copyright (C) 2000-2005, and GNU GPL'd, by Julian Seward et al.
==31304== For more details, rerun with: -v
==31304== 
==31304== 
==31304== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 25 from 2)
==31304== malloc/free: in use at exit: 48 bytes in 1 blocks.
==31304== malloc/free: 521 allocs, 520 frees, 43,016 bytes allocated.
==31304== For counts of detected errors, rerun with: -v
==31304== searching for pointers to 1 not-freed blocks.
==31304== checked 117,892 bytes.
==31304== 
==31304== 48 bytes in 1 blocks are definitely lost in loss record 1 of 1
==31304==    at 0x4004405: malloc (vg_replace_malloc.c:149)
==31304==    by 0xC40B5D: gaih_inet (in /lib/tls/libc-2.3.4.so)
==31304==    by 0xC445DC: getaddrinfo (in /lib/tls/libc-2.3.4.so)
==31304==    by 0x400A5FA: ???
==31304==    by 0x804EF69: luaD_precall (in /usr/local/bin/lua)
==31304==    by 0x80589E0: luaV_execute (in /usr/local/bin/lua)
==31304==    by 0x804F27C: luaD_call (in /usr/local/bin/lua)
==31304==    by 0x804F2BD: luaD_callnoyield (in /usr/local/bin/lua)
==31304==    by 0x804D145: f_call (in /usr/local/bin/lua)
==31304==    by 0x804E8AB: luaD_rawrunprotected (in /usr/local/bin/lua)
==31304==    by 0x804F6EB: luaD_pcall (in /usr/local/bin/lua)
==31304==    by 0x804D1A7: lua_pcallk (in /usr/local/bin/lua)
==31304== 
==31304== LEAK SUMMARY:
==31304==    definitely lost: 48 bytes in 1 blocks.
==31304==      possibly lost: 0 bytes in 0 blocks.
==31304==    still reachable: 0 bytes in 0 blocks.
==31304==         suppressed: 0 bytes in 0 blocks.
[spc]lucy:/tmp>

but given that “Project: Cleese” is written in Lua, a garbage collected language, memory leaks weren't foremost in the mind when testing. Had freeaddrinfo() on Linux (and Mac OS-X) not been so forgiving (or defensive) then this bug would most likely have been found immediately, and not hidden in the codebase for over five years! (I checked the history of the code in question—it had been there a long time—way before “Project: Cleese” was even started)

It is because of bugs like this that I am not a fan of defensive programming. They can hide. They can fester. They can be a nightmare to debug at 3:00 am on a production system sans a debugger.

The Boston Diaries

Tuesday, November 06, 2018

Assert yourself

Obligatory Picture

Obligatory Contact Info

Obligatory Feeds

Obligatory Links

Obligatory Miscellaneous