Wednesday, April 03, 2013
A good example of building parsing expressions on the fly
As part of the regression test, one of the ways I check the results is to scan through the log files, checking the output matches what we expect to happen. The components (and there a quite a few) log KPIs in the format:
stats|F0021: fooreq=123 fooabrts=0 fooerrs=1 foo-qs=2/14 foo-latency=1.2/4.2/15.2ms foo-bar-xlats=3
(Each log line from each component contains a unique identifier which
makes it easy to grep
for particular lines of interest in the
logs.)
In the past week, a major change was made in how the KPI are reported (used to be cumulative over the run of the program—now they're reset as they're logged) required a change in how my testing program processed the logs and I decided this was a good time as any to make a dramatic change in how that's done.
Prior, I used the built in Lua string patterns (think regular expressions) for parsing, but as happens, such code tends to be hard to read, and additional processing is required to massage the resulting values from strings to numbers. So I thought, why not use LPeg for my parsing needs?
-- The rest of the code assumes the following has been declared local lpeg = require "lpeg" local Ct = lpeg.Ct -- capture results into a table local Cc = lpeg.Cc -- matches "", return given value as capture local Cg = lpeg.Cg -- capture results and assign to given name local R = lpeg.R -- range capture local P = lpeg.P -- literal capture local S = lpeg.S -- character set capture
To parse a number, we look for at least one character in the range of “0” to “9” (inclusive), and pass the match to a function that will convert the string to a numeric value, which is returned as the capture:
local num = R"09"^1 / function(c) return tonumber(c) end
For non-integers, we can use the following bit of code—yes, this won't parse all floating point numbers, but we won't need much more than this for what's logged—an optional minus sign, followed by digits, and an optional decimal point plus more digits, which is also translated t a numeric value:
local fnum = (P"-"^-1 * num * (P"." * num)^-1) / function(c) return tonumber(c) end
We also have a few multi-value KPIs, some with two values, and some with three values. These are parsed with the values returned in a table with the values in conveniently named fields:
local lat = Ct( -- latency P"no-data" * Cg(Cc(0),'min') * Cg(Cc(0),'avg') * Cg(Cc(0),'max') * Cg(Cc(false),'valid') ) + Ct( Cg(fnum,'min') * P'/' * Cg(fnum,'avg') * P'/' * Cg(fnum,'max') * P'ms' * Cg(Cc(true),'valid') ) local cnt = Ct( -- counts P"no-data" * Cg(Cc(0),'avg') * Cg(Cc(0),'max') * Cg(Cc(false),'valid') ) + Ct( Cg(fnum,'avg') * P'/' * Cg(fnum,'max') * Cg(Cc(true),'valid') )
The only wrinkle here (and it's not much of a wrinkle) is that
foo-qs=no-data foo-latency=no-data
can be logged and we need to take that into account. If
no-data
is seen, the code returns the following table (if we're
parsing the three value KPI; the two value KPI returns something similar):
{ min = 0.0, avg = 0.0, max = 0.0, valid = false }
otherwise, the following will be returned:
{ min = 1.2, avg = 4.2, max = 15.2, valid = true }
To parse the actual KPIs, I could have done something like:
F0021 = Ct( P": " -- I have a reason for starting the parse at the ':' * P"fooreq=" * Cg(num,'foofeq') * P" " * P"fooabrts=" * Cg(num,'fooabrts') * P" " * P"fooerrs=" * Cg(num,'fooerrs') * P" " * P"foo-qs=" * Cg(cnt,'foo-qs') * P" " * P"foo-latency=" * Cg(lat,'foo-latency') * P" " * P"foo-bar-xlats=" * Cg(num,'foo-bar-xlats') )
But I didn't. Not because it's a lot of typing, but because the probability of errors is high because of the repetitious nature of it—I'm repeating the field names twice (look closely and you can see where I messed up). I'd rather not repeat myself, especially since I have multiple lines of KPIs to parse. Ideally, I'd like to specify the name of the field once, something like:
{ { "fooreq" , num } , -- might as well include { "fooabrts" , num } , -- the type of the KPI { "fooerrs" , num } , -- while I'm at it { "foo-qs" , cnt } , { "foo-latency" , lat } , { "foo-bar-xlats" , num } }
And, because LPeg is composable, I can get away with that:
local function mkmatch(list) local pattern = P": " for i = 1 , #list do pattern = pattern * P(list[i][1]) -- look for name of field * P"=" * Cg(list[i][2],list[i][1]) -- capture value as name * P" "^0 -- optional space at the end end return Ct(pattern) -- return all captures in a table end F0021 = mkmatch { { "fooreq" , num } , { "fooabrts" , num } , { "fooerrs" , num } , { "foo-qs" , cnt } , { "foo-latency" , lat } , { "foo-bar-xlats" , num } }
And there goes the repetition.
Now, why do I start parsing at the colon? Simple, I use the Lua string patterns to quickly check the log lines—that's fairly easy to see what's going on, and from there, I can call the appropriate LPeg pattern to further parse the KPIs.