The Boston Diaries

The ongoing saga of a programmer who doesn't live in Boston, nor does he even like Boston, but yet named his weblog/journal “The Boston Diaries.”

Go figure.

Saturday, July 04, 2020

Adventures in Formatting

If you are reading this via Gopher and it looks a bit different, that's because I spent the past few hours (months?) working on a new method to render HTML into plain text. When I first set this up I used Lynx because it was easy and I didn't feel like writing the code to do so at the time. But I've never been fully satisfied at the results [Yeah, I was never a fan of that either –Editor]. So I finally took the time to tackle the issue (and is one of the reasons I was timing LPEG expressions the other day [Nope. –Editor] … um … the other week [Still nope. –Editor] … um … a few years ago? [Last month. –Editor] [Last month? –Sean] [Last month. –Editor] [XXXX this timeless time of COVID-19 –Sean] last month).

The first attempt sank in the swamp. I wrote some code to parse the next bit of HTML (it would return either a string, or a Lua table containing the tag information). And that was fine for recent posts where I bother to close all the tags (taking into account only the tags that can appear in the body of the document, <P>, <DT>, <DD>, <LI>, <THEAD>, <TFOOT>, <TBODY>, <TR>. <TH>, and <TD> do not require a closing tag), but in earlier posts, say, 1999 through 2002, don't follow that convention. So I was faced with two choices—fix the code to recognize when an optional closing tag was missing, or fixing over a thousand posts.

It says something about the code that I started fixing the posts first …

I then decided to change my approach and try rewriting the HTML parser over. Starting from the DTD for HTML 4.01 strict I used the re module to write the parser, but I hit some form of internal limit I'm guessing, because that one burned down, fell over, and then sank into the swamp.

I decided to go back to straight LPEG, again following the DTD to write the parser, and this time, it stayed up.

It ended up being a bit under 500 lines of LPEG code, but it does a wonderful job of being correct (for the most part—there are three posts I've made that aren't HTML 4.01 strict, so I made some allowances for those). It not only handles optional ending tags, but the one optional opening tag I have to deal with—<TBODY> (yup—both the opening and closing tag are optional). And <PRE> tags cannot contain <IMG> tags while preserving whitespace (it's not in other tags). And check for the proper attributes for each tag.

Great! I can now parse something like this:

<p>This is my <a href="http://boston.conman.org/">blog</a>.
Is this not <em>nifty?</em>

<p>Yeah, I thought so.

into this:

tag =
{
  [1] =
  {
    tag = "p",
    attributes =
    {
    },
    block = true,
    [1] = "This is my ",
    [2] =
    {
      tag = "a",
      attributes =
      {
        href = "http://boston.conman.org/",
      },
      inline = true,
      [1] = "blog",
    },
    [3] = ". Is it not ",
    [4] =
    {
      tag = "em",
      attributes =
      {
      },
      inline = true,
      [1] = "nifty?",
    },
  },

  [2] =
  {
    tag = "p",
    attributes =
    {
    },
    block = true,
    [1] = "Yeah, I thought so.",
  },
}

I then began the process of writing the code to render the resulting data into plain text. I took the classifications that the HTML 4.01 strict DTD uses for each tag (you can see the <P> tag above is of type block and the <EM> and <A> tags are type inline) and used those to write functions to handle the approriate type of content—<P> can only have inline tags, <BLOCKQUOTE> only allows block type tags, and <LI> can have both; the rendering for inline and block types are a bit different, and handling both types is a bit more complex yet.

The hard part here is ensuring that the leading characters of <BLOCKQUOTE> (wherein the rendered text each line starts with a “| ”) and of the various types of lists (dictionary, unordered and ordered lists) are handled correctly—I think there are still a few spots where it isn't quite correct.

But overall, I'm happy with the text rendering I did, but I was left with one big surprise

Obligatory Picture

[It's the most wonderful time of the year!]

Obligatory Contact Info

Obligatory Feeds

Obligatory Links

Obligatory Miscellaneous

You have my permission to link freely to any entry here. Go ahead, I won't bite. I promise.

The dates are the permanent links to that day's entries (or entry, if there is only one entry). The titles are the permanent links to that entry only. The format for the links are simple: Start with the base link for this site: http://boston.conman.org/, then add the date you are interested in, say 2000/08/01, so that would make the final URL:

http://boston.conman.org/2000/08/01

You can also specify the entire month by leaving off the day portion. You can even select an arbitrary portion of time.

You may also note subtle shading of the links and that's intentional: the “closer” the link is (relative to the page) the “brighter” it appears. It's an experiment in using color shading to denote the distance a link is from here. If you don't notice it, don't worry; it's not all that important.

It is assumed that every brand name, slogan, corporate name, symbol, design element, et cetera mentioned in these pages is a protected and/or trademarked entity, the sole property of its owner(s), and acknowledgement of this status is implied.

Copyright © 1999-2020 by Sean Conner. All Rights Reserved.