The Boston Diaries

The ongoing saga of a programmer who doesn't live in Boston, nor does he even like Boston, but yet named his weblog/journal “The Boston Diaries.”

Go figure.

Saturday, July 04, 2020

Spending cache like its going out of style

I wrote an HTML parser. It works (for me—I tested it on all 5,082 posts I've made so far). But it came with one large surprise—it bloated up my gopher server something fierce—something like eight times larger than it used to be.


At first I thought it might be due the huge list of HTML entities (required to convert them to UTF-8).

A quick test revealed that not to be the case.

The rest of the code didn't seem all that outrageous, so fearing the worst, I commented out the HTML parser.

It was the HTML parser that bloated the code.


Now, there is a lot of code there to do case-insensitive matching of tags and attributes, so thinking that was the culpret, I converted the code to not do that (instead of looking for <P> and <p>, just check for <p>). And while that did show a measurable decrease, it wasn't enough to lose the case-insentive nature of the parser.

I didn't check to see if doing a generic parse of the attributes (accept anything) would help, because again, it did find some typos in some older posts (mostly TILTE instead of TITLE).

I did try loading the parsing module only when required, instead of upfront, but:

  1. it caused a massive spike in memory utilization when a post was requested;
  2. it also caused a noticible delay in generating the output as the HTML parser had to be compiled per request.

So, the question came down to—increase latency to lower overall memory usage, or consume memory to decrease a noticible latency?

Wait? I could just pre-render all entries as text and skip the rendering phase entirely, at the cost of yet more disk space …

So, increase time, increase memory usage, or increase disk usage?

As much as it pains me, I think I'll take the memory hit. I'm not even close to hitting swap on the server and what I have now works. If I fix the rendering (and there are still some corner cases I want to look into) I would have to remember to re-render all the entries if I do the pre-render strategy.

Obligatory Picture

[The future's so bright, I gotta wear shades]

Obligatory Contact Info

Obligatory Feeds

Obligatory Links

Obligatory Miscellaneous

You have my permission to link freely to any entry here. Go ahead, I won't bite. I promise.

The dates are the permanent links to that day's entries (or entry, if there is only one entry). The titles are the permanent links to that entry only. The format for the links are simple: Start with the base link for this site:, then add the date you are interested in, say 2000/08/01, so that would make the final URL:

You can also specify the entire month by leaving off the day portion. You can even select an arbitrary portion of time.

You may also note subtle shading of the links and that's intentional: the “closer” the link is (relative to the page) the “brighter” it appears. It's an experiment in using color shading to denote the distance a link is from here. If you don't notice it, don't worry; it's not all that important.

It is assumed that every brand name, slogan, corporate name, symbol, design element, et cetera mentioned in these pages is a protected and/or trademarked entity, the sole property of its owner(s), and acknowledgement of this status is implied.

Copyright © 1999-2024 by Sean Conner. All Rights Reserved.