Saturday, July 04, 2020
Spending cache like its going out of style
I wrote an HTML parser. It works (for me—I tested it on all 5,082 posts I've made so far). But it came with one large surprise—it bloated up my gopher server something fierce—something like eight times larger than it used to be.
Yikes!
At first I thought it might be due the huge list of HTML entities (required to convert them to UTF-8).
A quick test revealed that not to be the case.
The rest of the code didn't seem all that outrageous, so fearing the worst, I commented out the HTML parser.
It was the HTML parser that bloated the code.
Sigh.
Now,
there is a lot of code there to do case-insensitive matching of tags and attributes,
so thinking that was the culpret,
I converted the code to not do that
(instead of looking for <P>
and <p>
,
just check for <p>
).
And while that did show a measurable decrease,
it wasn't enough to lose the case-insentive nature of the parser.
I didn't check to see if doing a generic parse of the attributes
(accept anything) would help,
because again,
it did find some typos in some older posts
(mostly TILTE
instead of TITLE
).
I did try loading the parsing module only when required, instead of upfront, but:
- it caused a massive spike in memory utilization when a post was requested;
- it also caused a noticible delay in generating the output as the HTML parser had to be compiled per request.
So, the question came down to—increase latency to lower overall memory usage, or consume memory to decrease a noticible latency?
Wait? I could just pre-render all entries as text and skip the rendering phase entirely, at the cost of yet more disk space …
So, increase time, increase memory usage, or increase disk usage?
As much as it pains me, I think I'll take the memory hit. I'm not even close to hitting swap on the server and what I have now works. If I fix the rendering (and there are still some corner cases I want to look into) I would have to remember to re-render all the entries if I do the pre-render strategy.