The Boston Diaries

Tuesday, Debtember 01, 2009

A new version of the greylist daemon, in time for the holidays

Just in time for the Holiday Season I released a new version of my greylist daemon—it's the “Miscellaneous and Arbitrary Code Changes For the Christmas Holiday Season” version, although the code changes aren't quite that arbitrary. In fact, it's just a bunch of very small, somewhat arbitrary changes that don't really all fit into a single theme, unlike the previous versions.

So, if you are using it, you might want to check over the changes and see if you think it's time to upgrade.

Programs are buggy because error checking is tedious and error prone. Ironic, don't you think?

“There are no easy answers when it comes to reporting and handling errors.”
P. J. Plauger

I release a new version of the greylist daemon and what happens? Mark finds a bug.

Or rather, the new version didn't fix his current problem.

I take a look into the issue and it's not terribly surprising that I botched handing a particular error.

Sigh.

Error handling was taught by osmosis at college. It was expected that we just magically pick up on handling errors but really, as long as the program didn't outright crash and produced something vaguely like the output, all was fine.

In fact, when I expressly asked one my instructors what to do when handling a particular thorny error, I was told, point blank: “if you don't know how to handle the error, then don't check for it.” The instructor, sadly, had a point—you could go mad from trying to handle every possible error condition.

It's not hard to test for errors. In C, just about every function you can think of can return an error (with a few exceptions, like getpid() under Unix—if a process can't get its own process ID, then you have other things to worry about). But checking the return of every function call gets tedious, fast. What was once a small concise function:

int create_socket(struct sockaddr *paddr,socklen_t saddr)
{
  ListenNode          listen;
  struct epoll_event  ev;
  int                 reuse = 1;
  
  assert(paddr != NULL);
  assert(saddr &gt; 0);
  
  listen = malloc(sizeof(struct listen_node));
  
  memset(listen,0,sizeof(struct listen_node));
  memcpy(&amp;listen-&gt;local,paddr,saddr);
  
  listen-&gt;fn   = event_read;
  listen-&gt;sock = socket(paddr-&gt;sa_family,SOCK_DGRAM,0);
  
  setsockopt(listen-&gt;sock,SOL_SOCKET,SO_REUSEADDR,&amp;reuse,sizeof(reuse));
  fcntl(listen-&gt;sock,F_GETFL,0);
  fcntl(listen-&gt;sock,F_SETFL,rc | O_NONBLOCK);
  bind(listen-&gt;sock,paddr,saddr);
  memset(&amp;ev,0,sizeof(ev));

  ev.events   = EPOLLIN;
  ev.data.ptr = listen;
  epoll_ctl(g_queue,EPOLL_CTL_ADD,listen-&gt;sock,&amp;ev);

  return listen-&gt;sock;
}

becomes twice as big as each function call is wrapped up in an if statement:

int create_socket(struct sockaddr *paddr,socklen_t saddr)
{
  ListenNode          listen;
  struct epoll_event  ev;
  int                 reuse = 1;
  
  assert(paddr != NULL);
  assert(saddr > 0);
  
  listen = malloc(sizeof(struct listen_node));
  if (listen == NULL)
    return -1;
    
  memset(listen,0,sizeof(struct listen_node));
  memcpy(&listen->local,paddr,saddr);
  
  listen->fn   = event_read;
  listen->sock = socket(paddr->sa_family,SOCK_DGRAM,0);
  
  if (listen->sock == -1)
  {
    perror("socket()");
    return -1;
  }
  
  if (setsockopt(listen->sock,SOL_SOCKET,SO_REUSEADDR,&reuse,sizeof(reuse)) == -1)
  {
    perror("setsockopt()");
    return -1;
  }

  if fcntl(listen->sock,F_GETFL,0) == -1)
  {
    perror("fcntl(GETFL)");
    return -1;
  }
  
  if (fcntl(listen->sock,F_SETFL,rc | O_NONBLOCK) == -1)
  {
    perror("fcntl(SETFL)");
    return -1;
  }

  if (bind(listen->sock,paddr,saddr) == -1)
  {
    perror("bind()");
    return -1;
  }
  
  memset(&ev,0,sizeof(ev));
  ev.events   = EPOLLIN;
  ev.data.ptr = listen;
  
  if (epoll_ctl(g_queue,EPOLL_CTL_ADD,listen->sock,&ev) == -1)
  {
    perror("epoll_ctl(ADD)");
    return -1;
  }

  return listen->sock;
}

And even this isn't all that great. The function prints what happened (via the perror() call) but that presupposes that stderr (the standard error reporting file) is open! If it's not open, well then … if you're lucky, perror() (or whatever code it eventually calls) checks to see if stderr is open and if not, just … fail … gracefully? I guess? Hopefully? Maybe?

But even if we could return what failed to the caller, then that means the implementation details get pushed out of the create_socket() function (which is typically considered a “bad idea”). Even setting that aside, what can the caller do? Well, not much really.

The socket() call could fail because there's not enough kernel memory, or there's too many files already open, we don't have permission to create the socket, or the protocol isn't supported. Don't have the right privileges? If the process isn't root, there's not much that can be done (and if the process was root, there wouldn't be an issue). Not enough kernel memory? I wouldn't even know what to do in that case except kill off a few processes or reboot the box. But in each case, there isn't much the program can do except give up (and maybe attempt to log the error somewhere).

And all the other calls are pretty much the same—no memory, no privileges, can't do it, etc. We can break the errors down into a few categories:

programming errors, like EBADF (not an open file, or the operation couldn't be done given how the file was open originally) or EINVAL (invalid parameter) that need to be fixed, but once fixed, never happens again;
it can be fixed, like EACCESS (bad privileges) or ELOOP (too many symbolic links when trying to resolve a filename) but that the fix has to happen outside the scope of the program, but once fixed, tends not happen again unless someone made a mistake;
better exit the program as quickly and cleanly as possible because something bad, like ENOMEM (insuffient kernel memory) just happened and things are going bad quickly. Depending upon the circumstances, a fast, hard crash might be the best thing to do;
and finally, the small category of errors that a program might be able to handle, like ENOENT (file doesn't exist) depending upon the context (it could then create the file, or ask the user for a different file, etc.).

The problem being: a program can run for ages before you see an error (Mark ran my greylist daemon for a few years before the error manifested itself, and only then because the server operating system was upgraded—the bug he hit was of the last category—something it could have handled, but I didn't handle it properly) so it's not uncommon for error paths in programs to have, well, errors (ironically enough).

In fact, there's really only three ways to handle errors:

every subroutine returns an indication of success or failure (C uses this—the “calls filter downward, errors bubble upwards” model) and every call site needs a check;
subroutines can cause an exception, which immediately transfers program flow control to some earlier caller up the stack frame, which caller gets the exception depends upon which caller is expecting which exception (C++ uses this—the “dynamic spaghettiesque come-from” model);
ignore errors entirely and assume everything will always work (you can do this in any programming language—it's from the Alfred E. Neuman “What? Me, worry?” school of programming).

Each method has its pros and cons and nobody is really happy with any of the methods, but really, that's all there is when you get down to it. The first is tedious, but doesn't require any special langauge features; the second requires support in the language, and even so, is really spaghetti code in hiding and the third … well, again, no special language support is needed, isn't tedious and it tends to make the code fast but, well … let's just say that the error recovery can make a programmer go postal.

I just wish there was a better way …

Update on Wednesday, Debtember 2^nd, 2009

I could claim I left finding the errors in the code above as an exercise for the reader, but really, I blew it.

Wednesday, Debtember 02, 2009

I told you handing errors was error prone

From
Mark Grosberg <XXXXXXXXXXXXXXXXX>
To
sean@conman.org
Subject
Blog post.
Date
Wed, 2 Dec 2009 15:59:55 -0500 (EST)
I find it even more amusing that you didn't get the error handling right in the create_socket() on your current blog post.
Notice that you leak the socket and/or memory in the error cases. I guess it really is hard to handle errors. ;-)
Sorry, I just had to take this cheap shot!
-MYG

Heh.

Yup, I blew it again for demonstration purposes.

The code I posted yesterday was actually pulled from a current project where the create_socket() is only called during initialization and if it fails, the program exits. Since I'm on a Unix system, the “lost” resources like memory and sockets are automatically reclaimed. Not all operating systems are nice like this.

There are a few ways to fix this. One, is to use a langauge that handles such details automatically with garbage collection, but I'm using C so that's not an option. The second one is to add cleanup code at each point we exit, but using that we end up with code that looks like:

  /* ... */

  if fcntl(listen->sock,F_GETFL,0) == -1)
  {
    perror("fcntl(GETFL)");
    close(listen->socket); 
    free(listen);
    return -1;
  }
   
  if (fcntl(listen->sock,F_SETFL,rc | O_NONBLOCK) == -1)
  {
    perror("fcntl(SETFL)");
    close(listen->socket); 
    free(listen);
    return -1;   
  }
   
  if (bind(listen->sock,paddr,saddr) == -1)
  {
    perror("bind()");
    close(listen->socket);
    free(listen);
    return -1;   
  }

  /* ... */

Lots of duplicated code and the more complex the routine, the more complex the cleanup and potential to leak memory (or other resources like files and network connections). The other option looks like:

  /* ... */

  if fcntl(listen->sock,F_GETFL,0) == -1)
  {
    perror("fcntl(GETFL)");
    goto create_socket_cleanup;
  }
  
  if (fcntl(listen->sock,F_SETFL,rc | O_NONBLOCK) == -1)
  {
    perror("fcntl(SETFL)");
    goto create_socket_cleanup;
  }

  if (bind(listen->sock,paddr,saddr) == -1)
  {
    perror("bind()");
    goto create_socket_cleanup;
  }

  /* rest of code */

  return listen->sock; /* everything is okay */

create_socket_cleanup:
  close(listen->sock);
create_socket_cleanup_mem:
  free(listen);
  return -1;
}

This uses the dreaded goto construct, but is one of the few places that it's deemed “okay” to use goto, for cleaning up errors. No code duplication, but you need to make sure you cleanup (or unwind, or whatever) in reverse order.

So yeah, error handling … maddening.

I still wish there was a better way …

This actually doesn't sound half bad …

It's no secret that we've been openly critical of the prices charged by automakers for built-in GPS navigation systems. Frankly, paying $2,000 or more for an in-dash system when you can buy stand-alone navigation units for as little as $100 is ridiculous. Even the newer, larger seven-inch screen units are now down to as little as $250, and even though they aren't tied in to a vehicles' wheel sensors, they tend to be plenty accurate. Now, however, there is a new option that is even cheaper – as in (sort of) free.

It's only "sort of" free because the Google maps turn-by-turn navigation app is built into the new Motorola Droid smartphone (see sister-site Engadget's full review of the Droid here) that recently became available from Verizon Wireless. In this case, you have to sign up for two years of mobile phone service, which includes a data plan. I've been a Verizon customer for a decade and just happened to be up for a biennial discounted phone upgrade. When the Droid appeared a few weeks ago, the plan to wait until the new year for a Palm Pre was discarded. We've now had the chance to play with the Droid and its new navigation software, so follow the jump to find out if it lives up to expectations.

Via Instapundit, Review: Google Maps turn-by-turn navigation on Android 2.0 — Autoblog

For Corsair, who loves his gadgets, and because he hates AT&T.

Personally, I wouldn't mind this. I think the combination of Google Maps and a GPS is wonderful, although I wouldn't use the turn-by-turn navigation (map view is fine by me, with spot checks with the street view to avoid potentially bad areas). I would also like a larger screen, but hey, you can't have everything.

Shop for a home, from your home

And speaking of Google Maps—I did not know you can now see real estate listings, along with photos, videos, Wikipedia articles and webcams.

Wow.

I'm both impressed, and a bit scared at what Google knows …

Update a few minutes later …

&hellip and forclosures …

Friday, Debtember 04, 2009

Has it really been that long already?

[Would you like some cake with that sugar?]

Today marks the 10^th year of this blog. Not many bloggers have reached this milestone (how many bloggers do you know who started in 1999? I thought so).

I may not have posted consistently, but I'm still chugging away here 10 years later.

Tuesday, Debtember 08, 2009

The woodpeckers are coming

If builders built buildings the way programmers wrote programs, then the first woodpecker that came along would destroy civilization.

Sad to say that's the first thing that came to mind at the end of tonight's (or rather, this morning's) adventures.

Around midnight the Data Center In Boca Raton fell off the face of the Internet. I caught it just as it happened (checking things out on the new router I installed at a customer site some six hours earlier) and by the time I left a voice mail message to our upstream and talked to Smirk (he called as I was leaving the voice mail message), the Data Center In Boca Raton was back on the Internet.

Shortly after that, I was scanning the logs from snmptrapd (I have all our routers sending SNMP traps to a central server) I got fed up with seeing stuff like:

2009-12-06 06:28:08 XXXXXXXXXXXXXXXXXXXXXXXXX [XXXXXXXXXXXXXX]:
SNMPv2-MIB::sysUpTime.0 = Timeticks: (125022780) 14 days, 11:17:07.80
SNMPv2-MIB::snmpTrapOID.0 = OID: SNMPv2-SMI::mib-2.14.16.2.10
SNMPv2-SMI::mib-2.14.1.1 = IpAddress: XXXXXXXXXXXXXX
SNMPv2-SMI::mib-2.14.7.1.1 = IpAddress: XXXXXXXXXXXXXX
SNMPv2-SMI::mib-2.14.7.1.2 = INTEGER: 0 
SNMPv2-SMI::mib-2.14.10.1.3 = IpAddress: XXXXXXXXXXXXXX
SNMPv2-SMI::mib-2.14.16.1.3 = INTEGER: 4
SNMPv2-SMI::mib-2.14.4.1.2 = INTEGER: 1 
SNMPv2-SMI::mib-2.14.4.1.3 = IpAddress: XXXXXXXXXXXXXX
SNMPv2-SMI::mib-2.14.4.1.4 = IpAddress: XXXXXXXXXXXXXX

(only on one line). It makes it hard to figure out what the heck the router is complaining about and I wanted to change the format the MIBs to make them easier to read. I changed the command line options to snmptrapd only to get:

/usr/sbin/snmptrapd: symbol lookup error: /usr/lib/libnetsnmpmibs.so.5:
undefined symbol: netsnmp_TCPIPv6Domain

Mind you, it took a good ten minutes of scratching my head over why /etc/init.d/snmptrapd start wasn't before trying to run it at the command line.

All I know—it was running fine a few minutes before, but not now. I guess something changed in the 130 days since the server rebooted (my guess: a new version of snmptrapd without a corresponding new version of some library—did I mention I hate package managers?). No problem, as I had a locally installed copy in /usr/local/sbin/snmptrapd I could use.

I rebooted the server (it's a virtual server—takes less than a minute) when I noticed some odd issues with syslogd.

Okay, I'm not running the default syslog that comes with the distribution—no, I've been testing a homegrown syslog (which I will get around to talking about—it's quite cool) and it was basically hanging when starting up (enough that some program called minilogd was starting up, even though I have no XXXXXXX clue as to what is starting it—I can't find any reference to it in the startup scripts).

Eventually, I figure out it's blocking on a DNS lookup (I'm relaying syslog traffic to a centralized server, but that's, as Alton Brown says, is another show), which is odd, because DNS hasn't been an issue.

I check, and I see I'm only using one of the two DNS resolvers we have.

I can't resolve.

I can ping the DNS server from the server I'm on.

I can ssh to the DNS server from the server I'm on.

I just can't resolve DNS queries.

Now, the DNS resolver and the server I'm on are both virtual servers.

On the same physical computer.

The other resolver?

That's a virtual server on another physical computer and yes, I can resolve fine using that (so I set the default DNS resolver to be the one that is working while I try to troubleshoot the current issue that shouldn't be happening).

We used to have an issue with some virtual servers using that virtual DNS resolver, but I thought we had that licked months ago.

Maybe it's back?

I check iptables everywhere and no … should be fine.

A couple of hours go by.

I've finally isolated the issue—the resolver itself can't resolve.

But the other one can.

It was then I noticed some odd messages being logged to syslog and coming from our monitoring system:

HOST ALERT: XXXXXXXXXXXXXXX;DOWN;SOFT;14;
	(No Information Returned From Host Check)
HOST ALERT: XXXXXXXXXXXXXXX;DOWN;SOFT;15;CRITICAL 
	- Host Unreachable (XXXXXXXXXXXXX)
HOST ALERT: XXXXXXXXXXXXXXX;DOWN;SOFT;16;CRITICAL 
	- Host Unreachable (XXXXXXXXXXXXX)

Hmm … our monitoring system in Charlotte can't reach our resolver … okay, let's do a traceroute from Charlotte to the resolver and—

OH XXXXX XXXXXXX XXXXX ON A XXXXXXX XXXX XXXXX! No wonder I'm having DNS issues—the netblock the resolvers are on isn't being announced! WXXX TXX FXXX‽

That little outtage around midnight? Apparently our upstream's upstream had a slightly larger issue and couldn't route (what turned out to be) a few of our netblocks. We do have multiple connections to the Internet, but … well … it's a long story, but basically, just running BGP isn't enough—no, we have to send authorization emails to have the other provider to announce our routes that normally go through the one that had (and was still having) issues.

Okay, so the problem(s) at hand. The fact that the netblock our DNS resolvers were on weren't being announced would explain why the one resolver couldn't even resolve using itself; the other resolver probably had a larger working DNS cache and never had to send a query.

I swear, the number of “moving parts” a modern networked computer has to deal with is amazing, and it's amazing it works at all as well as it does, when it does. But man, when it breaks, it breaks and it's a bitch to troubleshoot (especially when you're doing it remotely—why even suspect the network in such a case?).

Tuesday, Debtember 22, 2009

Angels & Demons

Bunny and I saw “Angels and Demons” tonight, and I must say, it was way more enjoyable than the book, mainly because the movie excised some of the sillier aspects of the book, like the harrowing helicopter ride (and if you've read the book, you know of which I speak)and the death of the evil henchmen is handled better.

Oh, and no X-33 airplane (thankfully).

The backstory of the recently deceased Pope was removed, which was one aspect of the book I found interesting, but I can see why it was cut for the movie—simply no time to delve into it.

Overall, I enjoyed the movie way more then the book.

Then again, the book felt like a rought draft of a movie anyway.

Friday, Debtember 25, 2009

Festive Christian feast, good fellowship commemorating the rededcation of the Temple of Jerusalem and a high sprited African-American cultural festival

May your holiday be bright and merry.

Also, please enjoy David Sedaris talking about Christmas traditions in Europe (with some hunting traditions in Texas and Michigan—good stuff).

Tuesday, Debtember 01, 2009

A new version of the greylist daemon, in time for the holidays

Programs are buggy because error checking is tedious and error prone. Ironic, don't you think?

Update on Wednesday, Debtember 2^nd, 2009

Wednesday, Debtember 02, 2009

I told you handing errors was error prone

This actually doesn't sound half bad …

Shop for a home, from your home

Update a few minutes later …

Friday, Debtember 04, 2009

Has it really been that long already?

Tuesday, Debtember 08, 2009

The woodpeckers are coming

Tuesday, Debtember 22, 2009

Angels & Demons

Friday, Debtember 25, 2009

Festive Christian feast, good fellowship commemorating the rededcation of the Temple of Jerusalem and a high sprited African-American cultural festival

Obligatory Picture

Obligatory Contact Info

Obligatory Feeds

Obligatory Links

Obligatory Miscellaneous

Obligatory AI Disclaimer

The Boston Diaries

Update on Wednesday, Debtember 2nd, 2009

Update a few minutes later …

Obligatory Picture

Obligatory Contact Info

Obligatory Feeds

Obligatory Links

Obligatory Miscellaneous

Obligatory AI Disclaimer

Update on Wednesday, Debtember 2^nd, 2009