Tuesday, February 03, 2015
And I still haven't found what I'm looking for
If I have any text processing to do, I pretty much gravitate towards using LPeg. Sure, it might take a bit longer to generate code to parse some text, but it tends to be less “write only” than regular expressions.
Besides, you can do some pretty cool things with it. I have some LPeg code
that will parse the following strftime()
format string:
%A, %d %B %Y @ %H:%M:%S
and generate LPeg code that will parse:
Tuesday, 03 February 2015 @ 20:59:51
into:
date = { min = 57.000000, wday = 4.000000, day = 4.000000, month = 2.000000, sec = 16.000000, hour = 20.000000, year = 2015.000000, }
Or, if I set my locale correctly, I can turn this:
maŋŋebarga, 03 guovvamánu 2015 @ 21:00:21
into:
date = { min = 0,000000, wday = 3,000000, day = 3,000000, month = 2,000000, sec = 21,000000, hour = 21,000000, year = 2015,000000, }
But one annoyance that hits from time to time—named captures require a constant name. For instance, this pattern:
pattern = lpeg.Ct( lpeg.Cg(lpeg.P "A"^1,"class_a") * lpeg.P":" * lpeg.Cg(lpeg.P "B"^1,"class_b") )
(translated: when matching a string like AAAA:BBB
, return a
Lua
table (lpeg.Ct()
) with the As (lpeg.P()
) in
field class_a
(lpeg.Cg()
) and the Bs in field
class_b
)
applied to this string:
AAAA:BBB
returns this table:
{ class_a = "AAAA", class_b = "BBB }
The field names are constant—class_a
and
class_b
. I'd like a field name based on the input. Now, there is
a function lpeg.Cb()
that is described as:
Creates a back capture. This pattern matches the empty string and produces the values produced by the most recent group capture named
name
.Most recent means the last complete outermost group capture with the given name. A Complete capture means that the entire pattern corresponding to the capture has matched. An Outermost capture means that the capture is not inside another complete capture.
LPeg - Parsing Expression Grammars For Lua
A quick reading (and I'm guilty of this) leads me to think this:
pattern = lpeg.Cg(P"A"^1,"name") * lpeg.P":" * lpeg.Ct(lpeg.P "B"^1,lpeg.Cb("name"))
applied to the string:
AAAA:BBB
returns
{ AAAA = "BBB" }
But sadly, no. The only example of lpeg.Cb()
, used to parse
Lua long strings (which start with a “[”, zero or more “=”, another “[”, then
text, ended with a “]”, zero or more “=” (but the number of “=” must equal
the number of “=” between the two “[”) and a final “]”)):
equals = lpeg.P"="^0 open = "[" * lpeg.Cg(equals, "init") * "[" * lpeg.P"\n"^-1 close = "]" * lpeg.C(equals) * "]" closeeq = lpeg.Cmt(close * lpeg.Cb("init"), function (s, i, a, b) return a == b end) string = open * lpeg.C((lpeg.P(1) - closeeq)^0) * close / 1
shows that lpeg.Cb()
was designed with this partular use case
in mind—matching one pattern with the same pattern later on, and not what I
want.
I can do what I want (a field name based upon the input) but the way to go about it is very klunky (in my opinion):
pattern = lpeg.Cf( lpeg.Ct("") * lpeg.Cg( lpeg.C(lpeg.P"A"^1) * lpeg.P":" * lpeg.C(lpeg.P"B"^1) ) ,function(acc,name,value) acc[name] = value return acc end )
This is a “folding capture” (lpeg.Cf()
) where we
are accumulating our results (even though it's only one result—we have to do
it this way) in a table (lpeg.Ct()
) where each “value” is a
group (lpeg.Cg()
—the name is optional) consisting of a
collection (lpeg.C()
of As (lpeg.P()
) followed by a
colon (ignored), followed by a collection of Bs, all of which (except for the
colon—remember, it's ignored) are passed to a function that assigns the
string of Bs to a field name based on the string of As.
It gets even messier when you mix fixed field names with ones based upon the input. If all the field names are defined, it's easy to do something like:
eoln = P"\n" -- match end of line text = (P(1) - eoln)0 -- match anything but an end of line pattern = lpeg.Ct( P"field_one: " * Cg(text^0,"field_one") * eoln * P"field_two: " * Cg(text^0,"field_two") * eoln * P"field_three:" * Cg(text^0,"field_three") * eoln )
against data like this:
field_one: Lorem ipsum dolor sit amet field_two: consectetur adipiscing elit field_three: Cras aliquet enim elit
to get this:
{ field_one = "Lorem ipsum dolor sit amet", field_two = "consectetur adipiscing elit", field_three = "Cras aliquet enim elit" }
But if we have some defined fields, but want to accept non-defined field names, then … well … yeah … I haven't found a nice way of doing it. And I find it annoying that I haven't found what I'm looking for.