The Boston Diaries

The ongoing saga of a programmer who doesn't live in Boston, nor does he even like Boston, but yet named his weblog/journal “The Boston Diaries.”

Go figure.

Tuesday, February 03, 2015

And I still haven't found what I'm looking for

If I have any text processing to do, I pretty much gravitate towards using LPeg. Sure, it might take a bit longer to generate code to parse some text, but it tends to be less “write only” than regular expressions.

Besides, you can do some pretty cool things with it. I have some LPeg code that will parse the following strftime() format string:

%A, %d %B %Y @ %H:%M:%S

and generate LPeg code that will parse:

Tuesday, 03 February 2015 @ 20:59:51

into:

date =
{
  min = 57.000000,
  wday = 4.000000,
  day = 4.000000,
  month = 2.000000,
  sec = 16.000000,
  hour = 20.000000,
  year = 2015.000000,
}

Or, if I set my locale correctly, I can turn this:

maŋŋebarga, 03 guovvamánu 2015 @ 21:00:21

into:

date =
{
  min = 0,000000,
  wday = 3,000000,
  day = 3,000000,
  month = 2,000000,
  sec = 21,000000,
  hour = 21,000000,
  year = 2015,000000,
}

But one annoyance that hits from time to time—named captures require a constant name. For instance, this pattern:

pattern = lpeg.Ct(
               lpeg.Cg(lpeg.P "A"^1,"class_a")
             * lpeg.P":" 
             * lpeg.Cg(lpeg.P "B"^1,"class_b") 
           )

(translated: when matching a string like AAAA:BBB, return a Lua table (lpeg.Ct()) with the As (lpeg.P()) in field class_a (lpeg.Cg()) and the Bs in field class_b)

applied to this string:

AAAA:BBB

returns this table:

{
  class_a = "AAAA",
  class_b = "BBB
}

The field names are constant—class_a and class_b. I'd like a field name based on the input. Now, there is a function lpeg.Cb() that is described as:

Creates a back capture. This pattern matches the empty string and produces the values produced by the most recent group capture named name.

Most recent means the last complete outermost group capture with the given name. A Complete capture means that the entire pattern corresponding to the capture has matched. An Outermost capture means that the capture is not inside another complete capture.

LPeg - Parsing Expression Grammars For Lua

A quick reading (and I'm guilty of this) leads me to think this:

pattern = lpeg.Cg(P"A"^1,"name")
        * lpeg.P":"
        * lpeg.Ct(lpeg.P "B"^1,lpeg.Cb("name"))

applied to the string:

AAAA:BBB

returns

{
  AAAA = "BBB"
}

But sadly, no. The only example of lpeg.Cb(), used to parse Lua long strings (which start with a “[”, zero or more “=”, another “[”, then text, ended with a “]”, zero or more “=” (but the number of “=” must equal the number of “=” between the two “[”) and a final “]”)):

equals = lpeg.P"="^0
open = "[" * lpeg.Cg(equals, "init") * "[" * lpeg.P"\n"^-1
close = "]" * lpeg.C(equals) * "]"
closeeq = lpeg.Cmt(close * lpeg.Cb("init"), function (s, i, a, b) return a == b end)
string = open * lpeg.C((lpeg.P(1) - closeeq)^0) * close / 1

shows that lpeg.Cb() was designed with this partular use case in mind—matching one pattern with the same pattern later on, and not what I want.

I can do what I want (a field name based upon the input) but the way to go about it is very klunky (in my opinion):

pattern = lpeg.Cf(              
                  lpeg.Ct("")    
                * lpeg.Cg(
                       lpeg.C(lpeg.P"A"^1)
                     * lpeg.P":"
                     * lpeg.C(lpeg.P"B"^1)
                   )
                ,function(acc,name,value)
                   acc[name] = value
                   return acc
                 end
        )

This is a “folding capture” (lpeg.Cf()) where we are accumulating our results (even though it's only one result—we have to do it this way) in a table (lpeg.Ct()) where each “value” is a group (lpeg.Cg()—the name is optional) consisting of a collection (lpeg.C() of As (lpeg.P()) followed by a colon (ignored), followed by a collection of Bs, all of which (except for the colon—remember, it's ignored) are passed to a function that assigns the string of Bs to a field name based on the string of As.

It gets even messier when you mix fixed field names with ones based upon the input. If all the field names are defined, it's easy to do something like:

eoln = P"\n"		-- match end of line
text = (P(1) - eoln)0	-- match anything but an end of line

pattern = lpeg.Ct(
		  P"field_one: "  * Cg(text^0,"field_one")   * eoln
		* P"field_two: "  * Cg(text^0,"field_two")   * eoln
		* P"field_three:" * Cg(text^0,"field_three") * eoln
)

against data like this:

field_one: Lorem ipsum dolor sit amet
field_two: consectetur adipiscing elit
field_three: Cras aliquet enim elit

to get this:

{
  field_one = "Lorem ipsum dolor sit amet",
  field_two = "consectetur adipiscing elit",
  field_three = "Cras aliquet enim elit"
}

But if we have some defined fields, but want to accept non-defined field names, then … well … yeah … I haven't found a nice way of doing it. And I find it annoying that I haven't found what I'm looking for.

Obligatory Picture

[It's the most wonderful time of the year!]

Obligatory Links

Obligatory Miscellaneous

You have my permission to link freely to any entry here. Go ahead, I won't bite. I promise.

The dates are the permanent links to that day's entries (or entry, if there is only one entry). The titles are the permanent links to that entry only. The format for the links are simple: Start with the base link for this site: http://boston.conman.org/, then add the date you are interested in, say 2000/08/01, so that would make the final URL:

http://boston.conman.org/2000/08/01

You can also specify the entire month by leaving off the day portion. You can even select an arbitrary portion of time.

You may also note subtle shading of the links and that's intentional: the “closer” the link is (relative to the page) the “brighter” it appears. It's an experiment in using color shading to denote the distance a link is from here. If you don't notice it, don't worry; it's not all that important.

It is assumed that every brand name, slogan, corporate name, symbol, design element, et cetera mentioned in these pages is a protected and/or trademarked entity, the sole property of its owner(s), and acknowledgement of this status is implied.

Copyright © 1999-2019 by Sean Conner. All Rights Reserved.