The Boston Diaries

The ongoing saga of a programmer who doesn't live in Boston, nor does he even like Boston, but yet named his weblog/journal “The Boston Diaries.”

Go figure.

Thursday, January 31, 2013

Parsers vs. regular expressions? No contest

I'm finding that where once in Lua to its regular expressions for parsing, I am now turning to LPeg—or rather, the re module, as I find it easier to understand the code once written.

For instance, the regression test program I wrote for work outputs the results of each test:

1.a.8 0-0 16-17 scp:  ASREQ (1) ASRESP (1) LNPHIT (1) SS7COMP (1) SS7XACT (1) tps:  ack-cmpl (1) cache-searches (1) cache-updates (1) termreq (1)

Briefly, the first three fields are the test case ID, and indications if certain data files changed. The scp field indicates which variables of the SCP (you can think of this as a service on a phone switch) were modified (these just happen to be in uppercase) and then the tps field indicates which TPS (our lead developer does have a sense of humor) were modified. But if a variable is added (or removed—it happens), the order can change and it makes checking the results against the expected results a bit of a challenge.

The result is some code to parse the output and check against the expected results. And for that, I find using the re module for parsing:

local re   = require "re"

G = [[
line		<- entry -> {}
entry		<- {:id: id :} 			 %s
		   {:seriala: serial :}	         %s
		   {:serialb: serial :}	         %s
		   'scp:' {:scp: items* -> {} :} %s
		   'tps:' {:tps: items* -> {} :}
id		<- %d+ '.' [a-z] '.' %d+
serial		<- %d+ '-' %d+
items		<- %s* { ([0-9A-Za-z] / '-')+ %s '(' %d+ ')' }

]]

parser = re.compile(G)

to be more understandable than using Lua-based regular expressions:

function parse(line)
  local res = {}
  
  local id,seriala,serialb,tscp,ttps = line:match("^(%S+)%s+(%S+)%s+(%S+)%s+scp%:%s+(.*)tps%:%s+(.*)")
  
  res.id      = id
  res.seriala = seriala
  res.serialb = serialb
  
  res.scp = {}
  res.tps = {}
  
  for item in tscp:gmatch("%s*(%S+%s%(%d+%))%s*") do
    res.scp[#res.scp + 1] = item
  end
  
  for item in ttps:gmatch("%s*(%S+%s%(%d+%))%s*") do
    res.tps[#res.tps + 1] = item
  end
  return res
end

with both returning the same results:

{
  scp =
  {
    [1] = "ASREQ (1)",
    [2] = "ASRESP (1)",
    [3] = "LNPHIT (1)",
    [4] = "SS7COMP (1)",
    [5] = "SS7XACT (1)",
  },
  id = "1.a.8",
  tps =
  {
    [1] = "ack-cmpl (1)",
    [2] = "cache-searches (1)",
    [3] = "cache-updates (1)",
    [4] = "termreq (1)",
  },
  serialb = "16-17",
  seriala = "0-0",
}

Personally, I find regular expressions to be an incomprehensible mess of random punctuation and letters, whereas the re module at least lets me label the parts of the text I'm parsing. I also find it easier to see what is happening six months later if I have to revisit the code.

Even more importantly, this is a real parser. Would you ranther debug a regular expression for just validating an email address or a grammar that validates all defined email headers (email address validation starts at line 464)?

Obligatory Picture

[It's the most wonderful time of the year!]

Obligatory Links

Obligatory Miscellaneous

You have my permission to link freely to any entry here. Go ahead, I won't bite. I promise.

The dates are the permanent links to that day's entries (or entry, if there is only one entry). The titles are the permanent links to that entry only. The format for the links are simple: Start with the base link for this site: http://boston.conman.org/, then add the date you are interested in, say 2000/08/01, so that would make the final URL:

http://boston.conman.org/2000/08/01

You can also specify the entire month by leaving off the day portion. You can even select an arbitrary portion of time.

You may also note subtle shading of the links and that's intentional: the “closer” the link is (relative to the page) the “brighter” it appears. It's an experiment in using color shading to denote the distance a link is from here. If you don't notice it, don't worry; it's not all that important.

It is assumed that every brand name, slogan, corporate name, symbol, design element, et cetera mentioned in these pages is a protected and/or trademarked entity, the sole property of its owner(s), and acknowledgement of this status is implied.

Copyright © 1999-2019 by Sean Conner. All Rights Reserved.