The Boston Diaries

Saturday, July 04, 2020

Adventures in Formatting

If you are reading this via Gopher and it looks a bit different, that's because I spent the past few hours (months?) working on a new method to render HTML into plain text. When I first set this up I used Lynx because it was easy and I didn't feel like writing the code to do so at the time. But I've never been fully satisfied at the results [Yeah, I was never a fan of that either –Editor]. So I finally took the time to tackle the issue (and is one of the reasons I was timing LPEG expressions ~~the other day~~ [Nope. –Editor] ~~… um … the other week~~ [Still nope. –Editor] … um … a few years ago? [Last month. –Editor] [Last month? –Sean] [Last month. –Editor] [XXXX this timeless time of COVID-19 –Sean] last month).

The first attempt sank in the swamp. I wrote some code to parse the next bit of HTML (it would return either a string, or a Lua table containing the tag information). And that was fine for recent posts where I bother to close all the tags (taking into account only the tags that can appear in the body of the document, <P>, <DT>, <DD>, <LI>, <THEAD>, <TFOOT>, <TBODY>, <TR>. <TH>, and <TD> do not require a closing tag), but in earlier posts, say, 1999 through 2002, don't follow that convention. So I was faced with two choices—fix the code to recognize when an optional closing tag was missing, or fixing over a thousand posts.

It says something about the code that I started fixing the posts first …

I then decided to change my approach and try rewriting the HTML parser over. Starting from the DTD for HTML 4.01 strict I used the re module to write the parser, but I hit some form of internal limit I'm guessing, because that one burned down, fell over, and then sank into the swamp.

I decided to go back to straight LPEG, again following the DTD to write the parser, and this time, it stayed up.

It ended up being a bit under 500 lines of LPEG code, but it does a wonderful job of being correct (for the most part—there are three posts I've made that aren't HTML 4.01 strict, so I made some allowances for those). It not only handles optional ending tags, but the one optional opening tag I have to deal with—<TBODY> (yup—both the opening and closing tag are optional). And <PRE> tags cannot contain <IMG> tags while preserving whitespace (it's not in other tags). And check for the proper attributes for each tag.

Great! I can now parse something like this:

<p>This is my <a href="http://boston.conman.org/">blog</a>.
Is this not <em>nifty?</em>

<p>Yeah, I thought so.

into this:

tag =
{
  [1] =
  {
    tag = "p",
    attributes =
    {
    },
    block = true,
    [1] = "This is my ",
    [2] =
    {
      tag = "a",
      attributes =
      {
        href = "http://boston.conman.org/",
      },
      inline = true,
      [1] = "blog",
    },
    [3] = ". Is it not ",
    [4] =
    {
      tag = "em",
      attributes =
      {
      },
      inline = true,
      [1] = "nifty?",
    },
  },

  [2] =
  {
    tag = "p",
    attributes =
    {
    },
    block = true,
    [1] = "Yeah, I thought so.",
  },
}

I then began the process of writing the code to render the resulting data into plain text. I took the classifications that the HTML 4.01 strict DTD uses for each tag (you can see the <P> tag above is of type block and the <EM> and <A> tags are type inline) and used those to write functions to handle the approriate type of content—<P> can only have inline tags, <BLOCKQUOTE> only allows block type tags, and <LI> can have both; the rendering for inline and block types are a bit different, and handling both types is a bit more complex yet.

The hard part here is ensuring that the leading characters of <BLOCKQUOTE> (wherein the rendered text each line starts with a “| ”) and of the various types of lists (dictionary, unordered and ordered lists) are handled correctly—I think there are still a few spots where it isn't quite correct.

But overall, I'm happy with the text rendering I did, but I was left with one big surprise …

Spending cache like its going out of style

I wrote an HTML parser. It works (for me—I tested it on all 5,082 posts I've made so far). But it came with one large surprise—it bloated up my gopher server something fierce—something like eight times larger than it used to be.

Yikes!

At first I thought it might be due the huge list of HTML entities (required to convert them to UTF-8).

A quick test revealed that not to be the case.

The rest of the code didn't seem all that outrageous, so fearing the worst, I commented out the HTML parser.

It was the HTML parser that bloated the code.

Sigh.

Now, there is a lot of code there to do case-insensitive matching of tags and attributes, so thinking that was the culpret, I converted the code to not do that (instead of looking for <P> and <p>, just check for <p>). And while that did show a measurable decrease, it wasn't enough to lose the case-insentive nature of the parser.

I didn't check to see if doing a generic parse of the attributes (accept anything) would help, because again, it did find some typos in some older posts (mostly TILTE instead of TITLE).

I did try loading the parsing module only when required, instead of upfront, but:

it caused a massive spike in memory utilization when a post was requested;
it also caused a noticible delay in generating the output as the HTML parser had to be compiled per request.

So, the question came down to—increase latency to lower overall memory usage, or consume memory to decrease a noticible latency?

Wait? I could just pre-render all entries as text and skip the rendering phase entirely, at the cost of yet more disk space …

So, increase time, increase memory usage, or increase disk usage?

As much as it pains me, I think I'll take the memory hit. I'm not even close to hitting swap on the server and what I have now works. If I fix the rendering (and there are still some corner cases I want to look into) I would have to remember to re-render all the entries if I do the pre-render strategy.

A relatively quiet Fourth of July

This has been one of the quietest Fourth of July I've experienced. It's been completely overcast with the roll of thunder off in the distance, the city of Boca Raton cancelled their fireworks show, and our neighbor across the street decided to celebrate with his backyard neighbor on the next street over.

Yes, there's the occasional burst of fireworks here and there, but it sounds nowhere near the levels of a war zone as it has in the past.

Happy Fourth of July! And keep safe out there!

Saturday, July 04, 2020

Adventures in Formatting

Spending cache like its going out of style

A relatively quiet Fourth of July

Obligatory Picture

Obligatory Contact Info

Obligatory Feeds

Obligatory Links

Obligatory Miscellaneous

Obligatory AI Disclaimer