Sunday, September 10, 2006
Playing a SAX
There's a project that might start up at The Company involving lots of
XML and C programming, so
I've been poking around libxml
. I'm thinking I might even want
to use this for mod_blog
to validate HTML (since libxml
has an HTML parser, and about a quarter of
the time I blow the coding on an entry and have to fix it).
One problem that crops up is the difficulting in getting errors as
libxml
is reading the document into memory. Sure, I can suck the
HTML in with one call:
htmlDocPtr doc = htmlParseFile(filename,NULL);
(yes, it is that simple). But not seeing how to change the underlying reporting mechanism (not that I looked all that hard), I decide to switch to the SAX interface for parsing. The SAX interface allows you to register functions to be called during portions of the HTML (or even XML) parsing. Yes, I can grab the errors as they happen, but now I have to resort to building the document into memory myself (more or less). But that's okay, since in theory, this will allow me to not only capture the errors, but filter the HTML as I see fit.
Two thing that popped right out at me.
First, the callback when a tag is found:
void startElement(void *user_data, const xmlChar *name, const xmlChar **attrs); void endelement(void *user_data, const xmlChar *name);In these callbacks, the
name
parameter is the name of the element. Theattrs
parameter contains the attributes for the start tag. The even indicies in the array will be attribute names, the odd indicies are the values, and the final index will contain aNULL
.
Using the SAX Interface of LibXML (a tutorial)
Okay, seems simple enough. I write some code:
static void start_tag(void *data,const xmlChar *name,const xmlChar **attr) { int i; /*-------------------------------------- ; similar to printf() but functionally ; a bit better. ; ; And yes, this is how I format comments ; in C. ;--------------------------------------*/ LineSFormat(StdoutStream,"$","<%a",name); for (i = 0 ; attr[i] != NULL ; i+= 2) { LineSFormat(StdoutStream,"$ $"," %a=\"%b\",attr[i],attr[i+2]); } }
And the first time this code runs it crashes.
It seems that the documentation is a bit misleading—attr
is
only valid if there are attributes. Otherwise a NULL
is
passed in, which means you have to explicitely check attr
for
NULL
!
Aaaaah!
Would it have been that difficult for the authors of
libxml
to always pass in a valid attr
, even if it's
two elements long that both contain NULL
? (I suppose most
programmers would check anyway just because, and the bloat
continues)
The second thing. Catching the errors. Yeah. The call backs for those?
void sax_error(void *data,const char *msg, ... );
The errors (and warnings, and fatal errors) are passed back as a
printf()
style message.
So forget about intelligently handling the errors unless you want to parse the actual error messages.
Aaaaaaaarg!