Thursday, January 31, 2013
Parsers vs. regular expressions? No contest
I'm finding that where once in Lua to its regular expressions for parsing, I
am now turning to LPeg—or rather, the
re
module,
as I find it easier to understand the code once written.
For instance, the regression test program I wrote for work outputs the results of each test:
1.a.8 0-0 16-17 scp: ASREQ (1) ASRESP (1) LNPHIT (1) SS7COMP (1) SS7XACT (1) tps: ack-cmpl (1) cache-searches (1) cache-updates (1) termreq (1)
Briefly, the first three fields are the test case ID, and indications if
certain data files changed. The scp
field indicates which
variables of the SCP (you
can think of this as a service on a phone switch) were modified (these just
happen to be in uppercase) and then the tps
field indicates
which TPS (our lead
developer does have a sense of humor) were
modified. But if a variable is added (or removed—it happens), the order
can change and it makes checking the results against the expected results a
bit of a challenge.
The result is some code to parse the output and check against the
expected results. And for that, I find using the re
module for
parsing:
local re = require "re" G = [[ line <- entry -> {} entry <- {:id: id :} %s {:seriala: serial :} %s {:serialb: serial :} %s 'scp:' {:scp: items* -> {} :} %s 'tps:' {:tps: items* -> {} :} id <- %d+ '.' [a-z] '.' %d+ serial <- %d+ '-' %d+ items <- %s* { ([0-9A-Za-z] / '-')+ %s '(' %d+ ')' } ]] parser = re.compile(G)
to be more understandable than using Lua-based regular expressions:
function parse(line) local res = {} local id,seriala,serialb,tscp,ttps = line:match("^(%S+)%s+(%S+)%s+(%S+)%s+scp%:%s+(.*)tps%:%s+(.*)") res.id = id res.seriala = seriala res.serialb = serialb res.scp = {} res.tps = {} for item in tscp:gmatch("%s*(%S+%s%(%d+%))%s*") do res.scp[#res.scp + 1] = item end for item in ttps:gmatch("%s*(%S+%s%(%d+%))%s*") do res.tps[#res.tps + 1] = item end return res end
with both returning the same results:
{ scp = { [1] = "ASREQ (1)", [2] = "ASRESP (1)", [3] = "LNPHIT (1)", [4] = "SS7COMP (1)", [5] = "SS7XACT (1)", }, id = "1.a.8", tps = { [1] = "ack-cmpl (1)", [2] = "cache-searches (1)", [3] = "cache-updates (1)", [4] = "termreq (1)", }, serialb = "16-17", seriala = "0-0", }
Personally, I find regular expressions to be an incomprehensible mess of
random punctuation and letters, whereas the re
module at least
lets me label the parts of the text I'm parsing. I also find it easier to
see what is happening six months later if I have to revisit the code.
Even more importantly, this is a real parser. Would you ranther debug a regular expression for just validating an email address or a grammar that validates all defined email headers (email address validation starts at line 464)?