The Boston Diaries

The ongoing saga of a programmer who doesn't live in Boston, nor does he even like Boston, but yet named his weblog/journal “The Boston Diaries.”

Go figure.

Sunday, September 10, 2006

Playing a SAX

There's a project that might start up at The Company involving lots of XML and C programming, so I've been poking around libxml. I'm thinking I might even want to use this for mod_blog to validate HTML (since libxml has an HTML parser, and about a quarter of the time I blow the coding on an entry and have to fix it).

One problem that crops up is the difficulting in getting errors as libxml is reading the document into memory. Sure, I can suck the HTML in with one call:

htmlDocPtr doc = htmlParseFile(filename,NULL);

(yes, it is that simple). But not seeing how to change the underlying reporting mechanism (not that I looked all that hard), I decide to switch to the SAX interface for parsing. The SAX interface allows you to register functions to be called during portions of the HTML (or even XML) parsing. Yes, I can grab the errors as they happen, but now I have to resort to building the document into memory myself (more or less). But that's okay, since in theory, this will allow me to not only capture the errors, but filter the HTML as I see fit.

Two thing that popped right out at me.

First, the callback when a tag is found:

void startElement(void           *user_data,
			 const xmlChar  *name,
			 const xmlChar **attrs);

void endelement(void          *user_data,
		       const xmlChar *name);

In these callbacks, the name parameter is the name of the element. The attrs parameter contains the attributes for the start tag. The even indicies in the array will be attribute names, the odd indicies are the values, and the final index will contain a NULL.

Using the SAX Interface of LibXML (a tutorial)

Okay, seems simple enough. I write some code:

static void start_tag(void *data,const xmlChar *name,const xmlChar **attr)
{
  int i;

  /*--------------------------------------
  ; similar to printf() but functionally
  ; a bit better.
  ;
  ; And yes, this is how I format comments
  ; in C.
  ;--------------------------------------*/

  LineSFormat(StdoutStream,"$","<%a",name);

  for (i = 0 ; attr[i] != NULL ; i+= 2)
  {
    LineSFormat(StdoutStream,"$ $"," %a=\"%b\",attr[i],attr[i+2]);
  }
}

And the first time this code runs it crashes.

It seems that the documentation is a bit misleading—attr is only valid if there are attributes. Otherwise a NULL is passed in, which means you have to explicitely check attr for NULL!

Aaaaah!

Would it have been that difficult for the authors of libxml to always pass in a valid attr, even if it's two elements long that both contain NULL? (I suppose most programmers would check anyway just because, and the bloat continues)

The second thing. Catching the errors. Yeah. The call backs for those?

void sax_error(void *data,const char *msg, ... );

The errors (and warnings, and fatal errors) are passed back as a printf() style message.

So forget about intelligently handling the errors unless you want to parse the actual error messages.

Aaaaaaaarg!

Obligatory Picture

[The future's so bright, I gotta wear shades]

Obligatory Contact Info

Obligatory Feeds

Obligatory Links

Obligatory Miscellaneous

You have my permission to link freely to any entry here. Go ahead, I won't bite. I promise.

The dates are the permanent links to that day's entries (or entry, if there is only one entry). The titles are the permanent links to that entry only. The format for the links are simple: Start with the base link for this site: https://boston.conman.org/, then add the date you are interested in, say 2000/08/01, so that would make the final URL:

https://boston.conman.org/2000/08/01

You can also specify the entire month by leaving off the day portion. You can even select an arbitrary portion of time.

You may also note subtle shading of the links and that's intentional: the “closer” the link is (relative to the page) the “brighter” it appears. It's an experiment in using color shading to denote the distance a link is from here. If you don't notice it, don't worry; it's not all that important.

It is assumed that every brand name, slogan, corporate name, symbol, design element, et cetera mentioned in these pages is a protected and/or trademarked entity, the sole property of its owner(s), and acknowledgement of this status is implied.

Copyright © 1999-2024 by Sean Conner. All Rights Reserved.