The Boston Diaries

Wednesday, Debtember 14, 2022

An annotated example of using LPeg to parse a string to generate LPeg to parse other strings

A message on the Lua email list was asking about the best way to parse MQTT topics, specifically, how to handle the multilevel wildcard character. I answered that LPeg would be good for this, and gave annotated source code to show how it works. I thought I might also post about it for better visibility.

So, here's the code:

local lpeg = require "lpeg"
local Cc   = lpeg.Cc
local Cf   = lpeg.Cf
local P    = lpeg.P
local R    = lpeg.R

local filter do
  local separator = P'/'
  local topic     = R("AZ","az","09")^1 * (#separator + P(-1))
  local single    = P'+'                * (#separator + P(-1))
  local multi     = (P"/#" + P'#')      * P(-1)
  

  local csep    = separator / function()  return separator end
  local ctopic  = topic     / function(c) return P(c) end
  local csingle = single    / function()  return topic^-1 end
  local cmulti  = multi
                / function() return (separator * topic)^0 * P(-1) end

  filter = (P"#" * P(-1))
         / function() return (separator^-1 * topic)^0 * P(-1) end
         + Cf(  
               (-P"/#" * (ctopic + csingle + csep))^0 * cmulti^-1 * Cc(P(-1)),
               function(a,r)
                 return a * r
               end
             ) * P(-1)
end

And now the annotations—code fragment first, then annnotation.

local lpeg = require "lpeg"
local Cc   = lpeg.Cc
local Cf   = lpeg.Cf
local P    = lpeg.P
local R    = lpeg.R

This loads the LPeg module into Lua. I also grab the functions I'll be using from the module into locals. I do this not for speed purposes (although it will be slightly faster) but to reduce code clutter—there will be less lpeg. littered about the code, and I find that easier to read personally. It's not required that this be done.

local filter do
  -- ...
end

filter will contain the resulting LPeg expression. I create a new scope since the variables I'll be declaring won't be used outside of the definition for filter and it seems cleaner to me to reduce variable visibility as much as possible. It will also mean that over time (if this is intended for code that runs for a long time) the local variables created in this scope will be reclaimed as garbage. It's just a stylistic choice I do for Lua.

local separator = P'/'

This defines an LPeg expression that matches a literal slash character. The P() function can do a bit more than match literal strings, but we'll be mostly using it for literal string matches, as well as matching the end of the input string.

local topic = R("AZ","az","09")^1 * (#separator + P(-1))

This expression will match a “topic.” I'm using R() to match a range of characters (in this case, letters and digits). The multiplication sign (okay, an asterisk, but it's used to designate multiplication in Lua) here is used as an “AND” clause—a topic is a range of characters “AND” something else. That “something else” is either a separator (and the “#” mark is used to look ahead in the input without consuming it) or (the plus sign is read as “OR”) end of the input string.

local single = P'+' * (#separator + P(-1))

This expression will match a plus sign, which is used to indicate a single topic wildcard character. And again, we're expecting this to be followed by a separator character or the end of the string.

local multi = (P"/#" + P'#') * P(-1)

The “#” charcter is a multiple topic wildcard character and it must appear at the end of the string. I check for both “/#” and “#” because of a way I process the input later on. I might have found a better way to deal with this, but for a “proof-of-concept” this is good enough for now.

Now we get to the mind bending bit of this—I'm writing LPeg to parse a “topic filter” and generate an LPeg expression that will see if a “topic name” matches the “topic filter.”

local csep    = separator / function() return separator end
local ctopic  = topic     / function(c) return P(c) end
local csingle = single    / function()  return topic^-1 end
local cmulti  = multi
              / function() return (separator * topic)^0 * P(-1) end

These four expressions all do similar things—they match an existing pattern and pass the matching text to a function which returns an LPeg expression. csep returns an expression that matches the separator; ctopic returns an expression that matches the literal topic just parsed; csingle returns an expression that matches an alphanumeric string that represents a topic; and finally cmulti returns an expression that matches the remaining input.

And the final bit of code:

filter = (P"#" * P(-1))
       / function() return (separator^-1 * topic)^0 * P(-1) end
       + Cf(  
             (-P"/#" * (ctopic + csingle + csep))^0 * cmulti^-1 * Cc(P(-1)),
             function(a,r)
               return a * r
             end
           ) * P(-1)

The first line just matches a single multiple topic character and returns a pattern that will match the input. If that doesn't match (remember, “+” is read as an “OR”) we do a folding capture (Cf())—the code parses through the “topic filter” and builds an LPeg expression using a folding capture that will parse “topic names“ per the filter. Each piece that does match and return a capture will be “accumulated” into a single expression, the “folding” being done by the anonymous function passed in. The -P("/#") bit there looks ahead in the input to make sure it isn't a multiple topic wildcard character at the end of the string, and if that is the case, then it will compile a match for a literal topic, a non-specified topic (which fulfills the “single wildcard character match”) or a separator (but as long as the separator isn't itself followed by a multiple topic wildcard character, which is why we peek forward into the input). If we get to a point in the input where we either hit the end of input, or a multiple topic wildcard character, we handle that and cap the LPeg expression we're building with checking for end of the input (all on lines 4–7).

The point of all this is to turn a string like “+/tennis/+/#” into the following LPeg expression:

topic * separator * P"tennis" * separator * topic * (separator * topic)^0 * P(-1)

which can then be used to match “topic names:”

local topics =
{
  "news/tennis",                    -- won't match
  "news/tennis/mcenroe",            -- will match
  "news/football/dolphins",         -- won't match
  "news/baseball/marlins",          -- won't match
  "sports/tennis/williams/ranking", -- will match
}

local the_topic = filter:match("+/tennis/+/#")
for _,topic in ipairs(topics) do
  if the_topic:match(topic) then
    report_on_it(topic)
  end
end

Yes, there is a learning curve (okay, maybe a cliff) to LPeg. But once you get used to it, it is quite powerful and allows you to transform data in ways that you can't with regular expressions. In fact, instead of returning the default value (which is one past the position of the match in the string, or nil if it failed to parse) I could have instead returned an array of topics (or nil if it failed to parse)—but I will leave such changes as an exercise to the reader.

There's also a bit about dollar signs further down in the MQTT document I linked to, but again, handling that is left as an exercise for the reader.