The Boston Diaries

The ongoing saga of a programmer who doesn't live in Boston, nor does he even like Boston, but yet named his weblog/journal “The Boston Diaries.”

Go figure.

Wednesday, April 03, 2013

A good example of building parsing expressions on the fly

As part of the regression test, one of the ways I check the results is to scan through the log files, checking the output matches what we expect to happen. The components (and there a quite a few) log KPIs in the format:

stats|F0021: fooreq=123 fooabrts=0 fooerrs=1 foo-qs=2/14 foo-latency=1.2/4.2/15.2ms foo-bar-xlats=3

(Each log line from each component contains a unique identifier which makes it easy to grep for particular lines of interest in the logs.)

In the past week, a major change was made in how the KPI are reported (used to be cumulative over the run of the program—now they're reset as they're logged) required a change in how my testing program processed the logs and I decided this was a good time as any to make a dramatic change in how that's done.

Prior, I used the built in Lua string patterns (think regular expressions) for parsing, but as happens, such code tends to be hard to read, and additional processing is required to massage the resulting values from strings to numbers. So I thought, why not use LPeg for my parsing needs?

-- The rest of the code assumes the following has been declared
local lpeg = require "lpeg"

local Ct = lpeg.Ct -- capture results into a table
local Cc = lpeg.Cc -- matches "", return given value as capture
local Cg = lpeg.Cg -- capture results and assign to given name
local R  = lpeg.R  -- range capture
local P  = lpeg.P  -- literal capture
local S  = lpeg.S  -- character set capture

To parse a number, we look for at least one character in the range of “0” to “9” (inclusive), and pass the match to a function that will convert the string to a numeric value, which is returned as the capture:

local num = R"09"^1 / function(c) return tonumber(c) end

For non-integers, we can use the following bit of code—yes, this won't parse all floating point numbers, but we won't need much more than this for what's logged—an optional minus sign, followed by digits, and an optional decimal point plus more digits, which is also translated t a numeric value:

local fnum = (P"-"^-1 * num * (P"." * num)^-1) / function(c) return tonumber(c) end

We also have a few multi-value KPIs, some with two values, and some with three values. These are parsed with the values returned in a table with the values in conveniently named fields:

local lat  = Ct(                        -- latency
                  P"no-data"      *
                  Cg(Cc(0),'min') *
                  Cg(Cc(0),'avg') *
                  Cg(Cc(0),'max') *
                  Cg(Cc(false),'valid')
               )
             +
             Ct(
                 Cg(fnum,'min') * P'/'  *
                 Cg(fnum,'avg') * P'/'  *
                 Cg(fnum,'max') * P'ms' *
                 Cg(Cc(true),'valid')
               )

local cnt  = Ct(                        -- counts
                 P"no-data" *
                 Cg(Cc(0),'avg') *
                 Cg(Cc(0),'max') *
                 Cg(Cc(false),'valid')
               )
             +
             Ct(
                 Cg(fnum,'avg') * P'/' *
                 Cg(fnum,'max') *
                 Cg(Cc(true),'valid')
               )

The only wrinkle here (and it's not much of a wrinkle) is that

foo-qs=no-data foo-latency=no-data

can be logged and we need to take that into account. If no-data is seen, the code returns the following table (if we're parsing the three value KPI; the two value KPI returns something similar):

{
  min = 0.0,
  avg = 0.0,
  max = 0.0,
  valid = false
}

otherwise, the following will be returned:

{
  min = 1.2,
  avg = 4.2,
  max = 15.2,
  valid = true
}

To parse the actual KPIs, I could have done something like:

F0021 = Ct(
            P": " -- I have a reason for starting the parse at the ':'
          * P"fooreq="        * Cg(num,'foofeq')      * P" "
          * P"fooabrts="      * Cg(num,'fooabrts')    * P" "
          * P"fooerrs="       * Cg(num,'fooerrs')     * P" "
          * P"foo-qs="        * Cg(cnt,'foo-qs')      * P" "
          * P"foo-latency="   * Cg(lat,'foo-latency') * P" "
          * P"foo-bar-xlats=" * Cg(num,'foo-bar-xlats')
	)

But I didn't. Not because it's a lot of typing, but because the probability of errors is high because of the repetitious nature of it—I'm repeating the field names twice (look closely and you can see where I messed up). I'd rather not repeat myself, especially since I have multiple lines of KPIs to parse. Ideally, I'd like to specify the name of the field once, something like:

{
  { "fooreq"		, num } , -- might as well include
  { "fooabrts"		, num } , -- the type of the KPI
  { "fooerrs"		, num } , -- while I'm at it
  { "foo-qs"		, cnt } ,
  { "foo-latency"	, lat } ,
  { "foo-bar-xlats"	, num } 
}

And, because LPeg is composable, I can get away with that:

local function mkmatch(list)
  local pattern = P": "

  for i = 1 , #list do
    pattern = pattern
            * P(list[i][1])		-- look for name of field
            * P"="
            * Cg(list[i][2],list[i][1]) -- capture value as name
            * P" "^0			-- optional space at the end
  end

  return Ct(pattern) -- return all captures in a table
end

F0021 = mkmatch {
  { "fooreq"		, num } ,
  { "fooabrts"		, num } ,
  { "fooerrs"		, num } ,
  { "foo-qs"		, cnt } ,
  { "foo-latency"	, lat } ,
  { "foo-bar-xlats"	, num }
}

And there goes the repetition.

Now, why do I start parsing at the colon? Simple, I use the Lua string patterns to quickly check the log lines—that's fairly easy to see what's going on, and from there, I can call the appropriate LPeg pattern to further parse the KPIs.

Obligatory Picture

Trying to get into the festive mood this year

Obligatory Contact Info

Obligatory Feeds

Obligatory Links

Obligatory Miscellaneous

Obligatory AI Disclaimer

No AI was used in the making of this site, unless otherwise noted.

You have my permission to link freely to any entry here. Go ahead, I won't bite. I promise.

The dates are the permanent links to that day's entries (or entry, if there is only one entry). The titles are the permanent links to that entry only. The format for the links are simple: Start with the base link for this site: https://boston.conman.org/, then add the date you are interested in, say 2000/08/01, so that would make the final URL:

https://boston.conman.org/2000/08/01

You can also specify the entire month by leaving off the day portion. You can even select an arbitrary portion of time.

You may also note subtle shading of the links and that's intentional: the “closer” the link is (relative to the page) the “brighter” it appears. It's an experiment in using color shading to denote the distance a link is from here. If you don't notice it, don't worry; it's not all that important.

It is assumed that every brand name, slogan, corporate name, symbol, design element, et cetera mentioned in these pages is a protected and/or trademarked entity, the sole property of its owner(s), and acknowledgement of this status is implied.

Copyright © 1999-2024 by Sean Conner. All Rights Reserved.