The Boston Diaries

The ongoing saga of a programmer who doesn't live in Boston, nor does he even like Boston, but yet named his weblog/journal “The Boston Diaries.”

Go figure.

Saturday, Debtember 19, 2020

Details, details! It always comes down to the details

Back in July, I wrote an HTML parser using LPEG. I was a bit surprised to find the memory consumption to be higher than expected but decided to let it slide for the moment. Then in October (which I did not blog about—sigh) I decided to try using a C version of PEG. It was a rather straightforward port of the code and an almost drop-in replacement for the LPEG version (it required one line of code to change to use it). And much to my delight, not only did it use less memory (about ⅛TH of the memory) but it was also way faster (it ran in about 1/10TH the time).

It's not small though. The PEG code itself is 50K in size, the resulting C code is 764K in size (yes, that's nearly ¾ of a megabyte of source code), the resulting code is 607K in size. and with all that, it still runs with less memory than the LPEG version.

And all was fine.

Until today.

I've upgraded from Lua 5.3 (5.3.6 to be precise) to Lua 5.4 (5.4.2 to be precise). Lua 5.4 was released earlier this year, and I held off for a few months to let things settle before upgrading (and potentially updating all my code). Earlier this week I did the upgrade and proceeded to check that my code compiled and ran under the new version. All of it did, except for my new HTML parser, which caused Lua 5.4 to segfault.

With some help from the mailing list, I found the issue—I bascially ignored this bit from the Lua manual:

So, while using a buffer, you cannot assume that you know where the top of the stack is. You can use the stack between successive calls to buffer operations as long as that use is balanced; that is, when you call a buffer operation, the stack is at the same level it was immediately after the previous buffer operation. (The only exception to this rule is luaL_addval ue.)

Lua 5.4 Reference Manual

Oops. The original code was:

  lua_getfield(yy->L,lua_upvalueindex(UPV_ENTITY),label);
  entity = lua_tolstring(yy->L,-1,&len);
  luaL_addlstring(&yy->buf,entity,len);
  lua_pop(yy->L,1);

Even though it violated the manual, it worked fine through Lua 5.3. To fix it:

  lua_getfield(yy->L,lua_upvalueindex(UPV_ENTITY),label);
  luaL_addvalue(&yy->buf);

That works.

(The code itself converts a string like “CounterClockwiseContourIntegral” and converts it to the UTF-8 character “∳” using an existing conversion table.)

What I find funny is that I participated in a very similar thread three years ago!

Anyway, the code now works, and I'm continuing on the conversion process.


LPEG vs. PEG—they both have their strengths and weaknesses

While the C PEG library is faster and uses less memory than LPEG, I still prefer using LPEG, because it's so much easier to use than the C PEG library. Yes, there's a learning curve to using LPEG, but its re module uses a similar syntax to the C PEG library, and it's easier to read and write when starting out. Another difference is that LPEG practically requires all the input to parse as a single string, whereas the C PEG can do that, it can also read data from a file (you can stream data to LPEG, but it involves more work—check out the difference between a JSON parser that takes the entire input as a string versus a JSON parser that can stream data; the later is nearly twice the size of the former).

The code isn't that much different. Here's a simple LPEG parser that will parse text like “34.12.1.444” (a silly but simple example):

local re   = require "re"

return re.compile(
  [[
    tumbler <- number ( '.' number)*
    number  <- [0-9]+ -> pnum
  ]],
  {
    pnum = function(c) print(">>> " .. c) end,
  }
)

Not bad. And here's the C PEG version:

tumbler <- number ('.' number)*
number	<- < [0-9]+ > { printf(">>> %*s\n",yyleng,yytext); }

Again, not terrible and similar to the LPEG version.

The major difference between the two, however, is in their use. In the LPEG version, tumbler can be used in other LPEG expressions. If I needed to parse something like “34.12.1.444:text/plain; charset=utf-8”, I can do that:

local re = require "re"

return re.compile(
  [[
    example <- %tumbler SP* ':' SP* %mimetype
    SP      <- ' ' / '\t'
  ]],
  {
    tumbler  = require "tumbler",
    mimetype = require "org.conman.parsers.mimetype",
  }
)

The same cannot be said for the C PEG version. It's just not written to support such use. If I need to parse text like “32.12.1.444” and mimetypes, then I have to modify the parser to support it all—there's no easy way to combine different parsers.

That said, I would still use the C PEG library, but only when memory or performance is an issue. It certainly won't be because of convenience.

Obligatory Picture

[The future's so bright, I gotta wear shades]

Obligatory Contact Info

Obligatory Feeds

Obligatory Links

Obligatory Miscellaneous

You have my permission to link freely to any entry here. Go ahead, I won't bite. I promise.

The dates are the permanent links to that day's entries (or entry, if there is only one entry). The titles are the permanent links to that entry only. The format for the links are simple: Start with the base link for this site: https://boston.conman.org/, then add the date you are interested in, say 2000/08/01, so that would make the final URL:

https://boston.conman.org/2000/08/01

You can also specify the entire month by leaving off the day portion. You can even select an arbitrary portion of time.

You may also note subtle shading of the links and that's intentional: the “closer” the link is (relative to the page) the “brighter” it appears. It's an experiment in using color shading to denote the distance a link is from here. If you don't notice it, don't worry; it's not all that important.

It is assumed that every brand name, slogan, corporate name, symbol, design element, et cetera mentioned in these pages is a protected and/or trademarked entity, the sole property of its owner(s), and acknowledgement of this status is implied.

Copyright © 1999-2024 by Sean Conner. All Rights Reserved.