The Boston Diaries

Thursday, January 31, 2013

Parsers vs. regular expressions? No contest

I'm finding that where once in Lua to its regular expressions for parsing, I am now turning to LPeg—or rather, the re module, as I find it easier to understand the code once written.

For instance, the regression test program I wrote for work outputs the results of each test:

1.a.8 0-0 16-17 scp:  ASREQ (1) ASRESP (1) LNPHIT (1) SS7COMP (1) SS7XACT (1) tps:  ack-cmpl (1) cache-searches (1) cache-updates (1) termreq (1)

Briefly, the first three fields are the test case ID, and indications if certain data files changed. The scp field indicates which variables of the SCP (you can think of this as a service on a phone switch) were modified (these just happen to be in uppercase) and then the tps field indicates which TPS (our lead developer does have a sense of humor) were modified. But if a variable is added (or removed—it happens), the order can change and it makes checking the results against the expected results a bit of a challenge.

The result is some code to parse the output and check against the expected results. And for that, I find using the re module for parsing:

local re   = require "re"

G = [[
line		<- entry -> {}
entry		<- {:id: id :} 			 %s
		   {:seriala: serial :}	         %s
		   {:serialb: serial :}	         %s
		   'scp:' {:scp: items* -> {} :} %s
		   'tps:' {:tps: items* -> {} :}
id		<- %d+ '.' [a-z] '.' %d+
serial		<- %d+ '-' %d+
items		<- %s* { ([0-9A-Za-z] / '-')+ %s '(' %d+ ')' }

]]

parser = re.compile(G)

to be more understandable than using Lua-based regular expressions:

function parse(line)
  local res = {}
  
  local id,seriala,serialb,tscp,ttps = line:match("^(%S+)%s+(%S+)%s+(%S+)%s+scp%:%s+(.*)tps%:%s+(.*)")
  
  res.id      = id
  res.seriala = seriala
  res.serialb = serialb
  
  res.scp = {}
  res.tps = {}
  
  for item in tscp:gmatch("%s*(%S+%s%(%d+%))%s*") do
    res.scp[#res.scp + 1] = item
  end
  
  for item in ttps:gmatch("%s*(%S+%s%(%d+%))%s*") do
    res.tps[#res.tps + 1] = item
  end
  return res
end

with both returning the same results:

{
  scp =
  {
    [1] = "ASREQ (1)",
    [2] = "ASRESP (1)",
    [3] = "LNPHIT (1)",
    [4] = "SS7COMP (1)",
    [5] = "SS7XACT (1)",
  },
  id = "1.a.8",
  tps =
  {
    [1] = "ack-cmpl (1)",
    [2] = "cache-searches (1)",
    [3] = "cache-updates (1)",
    [4] = "termreq (1)",
  },
  serialb = "16-17",
  seriala = "0-0",
}

Personally, I find regular expressions to be an incomprehensible mess of random punctuation and letters, whereas the re module at least lets me label the parts of the text I'm parsing. I also find it easier to see what is happening six months later if I have to revisit the code.

Even more importantly, this is a real parser. Would you ranther debug a regular expression for just validating an email address or a grammar that validates all defined email headers (email address validation starts at line 464)?

Thursday, January 31, 2013

Parsers vs. regular expressions? No contest

Obligatory Picture

Obligatory Contact Info

Obligatory Feeds

Obligatory Links

Obligatory Miscellaneous

Obligatory AI Disclaimer