Adventures in Formatting

Saturday, July 04, 2020

If you are reading this via Gopher and it looks a bit different, that's because I spent the past few hours (months?) working on a new method to render HTML into plain text. When I first set this up I used Lynx because it was easy and I didn't feel like writing the code to do so at the time. But I've never been fully satisfied at the results [Yeah, I was never a fan of that either –Editor]. So I finally took the time to tackle the issue (and is one of the reasons I was timing LPEG expressions ~~the other day~~ [Nope. –Editor] ~~… um … the other week~~ [Still nope. –Editor] … um … a few years ago? [Last month. –Editor] [Last month? –Sean] [Last month. –Editor] [XXXX this timeless time of COVID-19 –Sean] last month).

The first attempt sank in the swamp. I wrote some code to parse the next bit of HTML (it would return either a string, or a Lua table containing the tag information). And that was fine for recent posts where I bother to close all the tags (taking into account only the tags that can appear in the body of the document, <P>, <DT>, <DD>, <LI>, <THEAD>, <TFOOT>, <TBODY>, <TR>. <TH>, and <TD> do not require a closing tag), but in earlier posts, say, 1999 through 2002, don't follow that convention. So I was faced with two choices—fix the code to recognize when an optional closing tag was missing, or fixing over a thousand posts.

It says something about the code that I started fixing the posts first …

I then decided to change my approach and try rewriting the HTML parser over. Starting from the DTD for HTML 4.01 strict I used the re module to write the parser, but I hit some form of internal limit I'm guessing, because that one burned down, fell over, and then sank into the swamp.

I decided to go back to straight LPEG, again following the DTD to write the parser, and this time, it stayed up.

It ended up being a bit under 500 lines of LPEG code, but it does a wonderful job of being correct (for the most part—there are three posts I've made that aren't HTML 4.01 strict, so I made some allowances for those). It not only handles optional ending tags, but the one optional opening tag I have to deal with—<TBODY> (yup—both the opening and closing tag are optional). And <PRE> tags cannot contain <IMG> tags while preserving whitespace (it's not in other tags). And check for the proper attributes for each tag.

Great! I can now parse something like this:

<p>This is my <a href="http://boston.conman.org/">blog</a>.
Is this not <em>nifty?</em>

<p>Yeah, I thought so.

into this:

tag =
{
  [1] =
  {
    tag = "p",
    attributes =
    {
    },
    block = true,
    [1] = "This is my ",
    [2] =
    {
      tag = "a",
      attributes =
      {
        href = "http://boston.conman.org/",
      },
      inline = true,
      [1] = "blog",
    },
    [3] = ". Is it not ",
    [4] =
    {
      tag = "em",
      attributes =
      {
      },
      inline = true,
      [1] = "nifty?",
    },
  },

  [2] =
  {
    tag = "p",
    attributes =
    {
    },
    block = true,
    [1] = "Yeah, I thought so.",
  },
}

I then began the process of writing the code to render the resulting data into plain text. I took the classifications that the HTML 4.01 strict DTD uses for each tag (you can see the <P> tag above is of type block and the <EM> and <A> tags are type inline) and used those to write functions to handle the approriate type of content—<P> can only have inline tags, <BLOCKQUOTE> only allows block type tags, and <LI> can have both; the rendering for inline and block types are a bit different, and handling both types is a bit more complex yet.

The hard part here is ensuring that the leading characters of <BLOCKQUOTE> (wherein the rendered text each line starts with a “| ”) and of the various types of lists (dictionary, unordered and ordered lists) are handled correctly—I think there are still a few spots where it isn't quite correct.

But overall, I'm happy with the text rendering I did, but I was left with one big surprise …

The Boston Diaries

Saturday, July 04, 2020

Adventures in Formatting

Obligatory Picture

Obligatory Contact Info

Obligatory Feeds

Obligatory Links

Obligatory Miscellaneous

Obligatory AI Disclaimer