Saturday, July 04, 2020
Adventures in Formatting
If you are reading this via Gopher and it looks a bit different,
that's because I spent the past few hours (months?) working on a new method to render HTML into plain text.
When I first set this up I used Lynx because it was easy and I didn't feel like writing the code to do so at the time.
But I've never been fully satisfied at the results [Yeah, I was never a fan of that either –Editor].
So I finally took the time to tackle the issue
(and is one of the reasons I was timing LPEG expressions the other day
[Nope. –Editor] … um … the other week
[Still nope. –Editor] … um … a few years ago?
[Last month. –Editor]
[Last month? –Sean]
[Last month. –Editor]
[XXXX this timeless time of COVID-19 –Sean]
last month).
The first attempt sank in the swamp.
I wrote some code to parse the next bit of HTML
(it would return either a string,
or a Lua table containing the tag information).
And that was fine for recent posts where I bother to close all the tags
(taking into account only the tags that can appear in the body of the document,
<P>
, <DT>
, <DD>
, <LI>
, <THEAD>
, <TFOOT>
, <TBODY>
, <TR>
. <TH>
, and <TD>
do not require a closing tag),
but in earlier posts,
say, 1999 through 2002,
don't follow that convention.
So I was faced with two choices—fix the code to recognize when an optional closing tag was missing,
or fixing over a thousand posts.
It says something about the code that I started fixing the posts first …
I then decided to change my approach and try rewriting the HTML parser over.
Starting from the DTD for HTML 4.01 strict I used the re
module
to write the parser,
but I hit some form of internal limit I'm guessing,
because that one burned down,
fell over,
and then sank into the swamp.
I decided to go back to straight LPEG, again following the DTD to write the parser, and this time, it stayed up.
It ended up being a bit under 500 lines of LPEG code,
but it does a wonderful job of being correct
(for the most part—there are three posts I've made that aren't HTML 4.01 strict,
so I made some allowances for those).
It not only handles optional ending tags,
but the one optional opening tag I have to deal with—<TBODY>
(yup—both the opening and closing tag are optional).
And <PRE>
tags cannot contain <IMG>
tags while preserving whitespace
(it's not in other tags).
And check for the proper attributes for each tag.
Great! I can now parse something like this:
<p>This is my <a href="http://boston.conman.org/">blog</a>. Is this not <em>nifty?</em> <p>Yeah, I thought so.
into this:
tag = { [1] = { tag = "p", attributes = { }, block = true, [1] = "This is my ", [2] = { tag = "a", attributes = { href = "http://boston.conman.org/", }, inline = true, [1] = "blog", }, [3] = ". Is it not ", [4] = { tag = "em", attributes = { }, inline = true, [1] = "nifty?", }, }, [2] = { tag = "p", attributes = { }, block = true, [1] = "Yeah, I thought so.", }, }
I then began the process of writing the code to render the resulting data into plain text.
I took the classifications that the HTML 4.01 strict DTD uses for each tag
(you can see the <P>
tag above is of type block
and the <EM>
and <A>
tags are type inline
)
and used those to write functions to handle the approriate type of content—<P>
can only have inline
tags,
<BLOCKQUOTE>
only allows block
type tags,
and <LI>
can have both;
the rendering for inline
and block
types are a bit different,
and handling both types is a bit more complex yet.
The hard part here is ensuring that the leading characters of <BLOCKQUOTE>
(wherein the rendered text each line starts with a “| ”)
and of the various types of lists (dictionary, unordered and ordered lists) are handled correctly—I think there are still a few spots where it isn't quite correct.
But overall, I'm happy with the text rendering I did, but I was left with one big surprise …
Spending cache like its going out of style
I wrote an HTML parser. It works (for me—I tested it on all 5,082 posts I've made so far). But it came with one large surprise—it bloated up my gopher server something fierce—something like eight times larger than it used to be.
Yikes!
At first I thought it might be due the huge list of HTML entities (required to convert them to UTF-8).
A quick test revealed that not to be the case.
The rest of the code didn't seem all that outrageous, so fearing the worst, I commented out the HTML parser.
It was the HTML parser that bloated the code.
Sigh.
Now,
there is a lot of code there to do case-insensitive matching of tags and attributes,
so thinking that was the culpret,
I converted the code to not do that
(instead of looking for <P>
and <p>
,
just check for <p>
).
And while that did show a measurable decrease,
it wasn't enough to lose the case-insentive nature of the parser.
I didn't check to see if doing a generic parse of the attributes
(accept anything) would help,
because again,
it did find some typos in some older posts
(mostly TILTE
instead of TITLE
).
I did try loading the parsing module only when required, instead of upfront, but:
- it caused a massive spike in memory utilization when a post was requested;
- it also caused a noticible delay in generating the output as the HTML parser had to be compiled per request.
So, the question came down to—increase latency to lower overall memory usage, or consume memory to decrease a noticible latency?
Wait? I could just pre-render all entries as text and skip the rendering phase entirely, at the cost of yet more disk space …
So, increase time, increase memory usage, or increase disk usage?
As much as it pains me, I think I'll take the memory hit. I'm not even close to hitting swap on the server and what I have now works. If I fix the rendering (and there are still some corner cases I want to look into) I would have to remember to re-render all the entries if I do the pre-render strategy.
A relatively quiet Fourth of July
This has been one of the quietest Fourth of July I've experienced. It's been completely overcast with the roll of thunder off in the distance, the city of Boca Raton cancelled their fireworks show, and our neighbor across the street decided to celebrate with his backyard neighbor on the next street over.
Yes, there's the occasional burst of fireworks here and there, but it sounds nowhere near the levels of a war zone as it has in the past.
Happy Fourth of July! And keep safe out there!
Tuesday, July 07, 2020
The magic tricks may be simple, but that's not to say they're easy to perform
Bunny and I watch “Fool Us,” a show where Penn & Teller showcase a number of magicians and try to figure out how the trick was performed. It's a cool show and we will often rewind bits (and again, and again) to see if we can spot the trick.
On the recent show we saw Wes Iseli whose trick with a single 50¢ piece fooled Penn & Teller. He walked out on stage, handed Alyson Hannigan (the hostess) a “prediction” for her to hold on to. He then had the entire audience stand up and had them call heads or tails as he flipped the 50¢ piece. After about ten flips, there was a single audience member left standing. Wes then asked Alyson to read the “prediction” he made, and it described the audience member left standing.
I think I know how it was done, and the only reason it fooled Penn & Teller was due to a bad guess on Penn's part (if they know of several ways a trick can be done, and they guess the wrong one, they're fooled). The thing about magic is that often times, the “trick” is so simple that once you know how it's done, it's like “that's it? That's how it was done?”
For instance, way back in the second season, Wes Barker did a trick where he speared a page from a phone book with a sword which fooled Penn & Teller. He recently revealed how he did the trick (because, as he stated, phone books don't exist anymore). The trick was stupidly simple and by overthinking the trick, Penn was fooled. Oh, and it was interesting to learn yet another method of tearing up a phone book (as if we'll ever have a phone book to rip up).
Another instance of a very simple method tricking Penn & Teller is a recent trick by Eric Leclerc. He did the “needle in a haystack” trick, but this time finding a marked packing peanut by Teller in a huge box of packing peanuts. And again, it was a very simple trick. It's amazing how simple these tricks really are. It's almost likey they're cheating. And in a way, I guess they are.
Monday, July 13, 2020
A twisty maze of little redirects, all alike
I have The Electric King James Bible. I then ported it to gopher. Sometime after that, I ported it again, this time to Gemini. I then received an email from Natalie Pendragon, who runs GUS, about the infinite redirection that happens when you try to read the the Book of Job via Gemini.
Sure enough, when I visted the The Book of Job on Gemini, I ended up in a maze of twisty little redirects, all alike.
So there's this file I have that lists not only the books of the Bible,
but the abbreviations for each book,
so instead of having to type http://bible.conman.org/kj/Genesis.1:1
you can type http://bible.conman.org/kj/Ge.1:1
and it'll do The Right Thing™.
Only for Job,
there is no abbreviation—instead,
I have “Job” listed as the abbreviation
(and the same issue goes for Joel).
So I guess that I handled that case in the web version
(don't let the timestamps fool you—I imported it into git
ten years ago,
but wrote the code over twenty years ago),
and the gopher version doesn't do redirections,
so it doesn't matter there,
but the Gemini version does do redirections,
and I didn't check that condition.
Oops.
The issue has now been fixed.
Wednesday, July 15, 2020
There's a disconnect somewhere
I received the following email from The Corporate Overlords:
All,
Microsoft is currently reporting service issue with the Outlook email client globally. They are rolling out fixes, however it may take several hours to complete.
Outlook on the web and mobile clients are unaffected. If you are currently experiencing Outlook crashes or problems accessing email via Outlook on your computer, please login with your [Corporate Overlord assigned] credentials to
https://outlook.office.com
or use the Outlook mail app on your mobile device.
There's an issue with using email, and they notify the users of this issue, by using email. I mean, I got it because I use the web version of Lookout. I have to wonder how many other people at the Corporation know of the issue …
Thursday, July 16, 2020
Adventures in Formatting II: Gemini Boogaloo
If you are reading this via Gemini, then welcome to my blog! Emboldened by converting HTML to text for gopher, I decided to try my hand at converting HTML to the native Gemini text format (section 5 of the specification), and I'm less than thrilled with the results and I don't think given the contraints that I could do a better job than I have.
The format has similarities to Markdown but simpler, and you can't embed HTML for when Markdown has no syntax to do what you want. I mean, given that Gemini supports serving up any type of content, I could have just served up HTML, but well, I have the webserver for that. And I could have just used the plain text format I use for gopher, but the Gemini text format does allow for links, and I like my links (even if external links have a half life of about a year).
Most of the entries will look okay, it's only the occasional entry with deeply nested HTML that might look wierd.
And yes, the size of the server bloated quite a bit since I reused the HTML parser, but it's something I'll just have to live with for now.
Monday, July 20, 2020
Adventures in Atom
I offer up several feeds for my blog, one of the formats being Atom, a much nicer (and better specified) than RSS (which I also offer by the way). There exists an Atom feed aggregator for Gemini. The Atom specification allows zero or more links to a blog entry. I discussed this with the author of CAPCOM (the Atom feed aggregator for Gemini) and it can deal with multiple links. I don't know about any other Atom aggregators out there, but I'm going to find out, because each entry now has three links per entry—one to the web version, one to the Gemini version and one to the gopher version.
I have a feeling this is going to be fun, and by “fun,” I mean “it will probably break things somewhere because who is stupid enough to use more than one link?”
Update a few moments later …
Of course, becuse of the way I translate links for the gopher version, those using gopher won't see the web version of the link. Sigh.
We're a long way from the halcyon days of ASCII-only text
Rendering text, how hard could it be? As it turns out, incredibly hard! To my knowledge, literally no system renders text "perfectly". It's all best-effort, although some efforts are more important than others.
I don't know if I'm glad or sad I didn't read this article before rendering HTML into different formats. On the one hand, it might have completely discouraged me from even starting. On the other hand, I don't use the full span of Unicode characters in my blog posts (which I'm very thankful for, after doing all this work). But on the gripping hand, Lynx was doing a horrible job at formatting HTML into text, and what I have now is much better looking.
So let's see … we now have three hard things in Computer Science:
- cache invalidation;
- naming things;
- text rendering;
- and “off-by-one” errors.
Yup, that seems about right.
Friday, July 24, 2020
Why not open the office back up on Halloween? It's be thematically cool to do so
I received a work email today, notifying everybody at the Ft. Lauderdale Office of the Corporation, that the office won't reopen until Monday, October 5TH. We were supposed to open in late June, maybe early July, but given that's it is now late July, the Powers That Be decided it might be better to just wait several more months than to continuously plan to open the office only to have to push the date back.
I can't say I'm upset at the decision.
And here I thought web bots were bad
I suppose it was only a matter of time, but the bad web robot behavior has finally reached Gemini. There's a bot out there that made 42,766 requests in the past 27 hours (so not quite one-per-second) until I got fed up with it and blocked it at the firewall. And according to my firewall, it's still trying to make requests. That tells me that whatever it is, it's running unattended. And several other people running Gemini servers have reported seeing the same client hammering their systems as well.
Now, while the requests average out to about one every two seconds, they actually come in bursts—a metric buttload pops in, a bunch fail to connect (probably because of some kernel limit) and all goes quiet for maybe half a minute before it starts up again. Had it actually limited the requests to one every two seconds (or even one per second) I probably wouldn't mind as much.
As it was though, quite a large number of the requests were malformed—it wasn't handling relative links properly, so I can only conclude it was written by the same set of geniuses that wrote the MJ12Bot.
Sigh.
On the plus side, it did reveal a small bug in the codebase, allowing some of the malformed requests to be successful when they shouldn't have been.
Well, the bugs start coming and they don't stop coming
As soon as I find one bug than I find another one.
In an unrelated program.
Sigh.
In this case, the bug was in my gopher server, or rather, the custom module for the gopher server that serves up my blog entries. Earlier this month, I rewrote parts of that module to convert HTML to text and I munged the part that serves up ancillary files like images. I found the bug as I was pulling links for the previous entry when I came across this entry from last year about the horrible job Lynx was doing converting HTML to text. In that post, I wrote what I would like to see and I decided to check how good of a job I did.
It's pretty much spot on, but I for some reason decided to view the image on that entry (via gopher), and that's when I found the bug.
The image never downloaded because the coroutine handling the request crashed,
which triggered a call to assert()
causing the server to stop running.
Oops.
The root cause was I forgot to prepend the storage path to the ancilliary files. And with that out of the way …
No, seriously, the bugs start coming and they don't stop coming
One of the links on this entry won't work on gopher or Gemini.
That's because I haven't implemented a very important feature from the web version of my blog—linking to an arbitrary portion of time!
I don't even think that link will work on gopher or Gemini either,
because of the way I “translate” links when generating the text from HTML.
HTML-to-text translation is hard, let's go shopping—oh, wait … I can't because of CODIV-19!
Sigh.
Update a few moments later …
Not all links break on Gemini.
Saturday, July 25, 2020
Please make the bugs stop
Let's see … there was the bug in my Gemini server, two bugs in a custom module for my gopher server, so it's little surprise to me to find a bug in a third server I wrote—this time my “Quote of the Day” server (no public repository for that one, but the service is easy to implement).
This bug was a simple “off-by-one” error—when I got to the last quote the program should have wrapped around to restart with the first quote, only that code was “off-by-one” (thus the name). It only took a while to hit because it required serving up 4,187 quotes before restarting and it's not like it gets hit all that often.
I'd say “I think that's it for now,” but I'm worried that would jinx me for a slew of new bugs to show up, so I won't say it.
Bad bots, bad bots, whatcha gonna do? Whatcha gonna do when they contact you?
I finally found a way to contact the party reponsible for slamming my Gemini server with requests. Hopefully I'll hear a response back, or at the very least, the behavior of the bad bot will be silently fixed.
I can only hope.
Tuesday, July 28, 2020
All is silent on the bad Gemini bots
On Saturday, I sent a message to the party responsible for slamming my Gemini server (one among several) and I've yet to receive any response. I removed the block from the firewall, and I haven't seen any requests from said bot. It looks to have been a one-off thing at this time.
Weird.
But then again, this is the Intarwebs, where weird things happen all the time.
At this point, I'm hoping it was fixed silently and it won't be an issue again.
Wednesday, July 29, 2020
I can't believe I didn't think of that sooner
Last week I was tasked with running the regession test for “Project: Sippy-Cup” and figure out any issues. Trying to figure out the issues was a bit harder than expected. I had my suspicions but the output wasn't quite conducive to seeing the overall picture. The output was almost, but not quite valid Lua. If it was valid Lua, I could load the data and write some code to verify my hypothesis, but alas, I had to write code to massage the output into a form that could be loaded.
What a drag.
I was able to prove my hypothesis (some contradictory features were enabled, but it's a “that can't happen in production” type scenario, and if it did happen in production it's of no real consequence). I then adjusted the regression test accordingly.
But afterwards, I adjusted the output slightly to make it valid Lua code. That way, it can be loaded via the Lua parser so further investigations of errors can be checked. I'm just a bit surprised that I didn't think of that sooner.
Update on Thrusday, July 30TH, 2020
Interfacing with the blackhole of the Intarwebs
Smirk called last night to ask me how I publish my blog to MyFaceMeLinked BookWeInSpace. I told him I do it by hand. When I post to my blog, I then go to FaceMeLinkedMyBookWeInSpace and manually post the link. It used to be an automatic feature but several years ago MeLinkedMyFaceWeInSpaceBook changed the API. He was curious because he was getting fed up with FaceMeLinkedMySpaceBookWeIn and their censorious ways, and wanted a way to post to both FaceMeLinkedMy SpaceBookWeIn and another website. Here's what I found.
First:
The
publish_actions
permission will be deprecated. This permission granted apps access to publish posts to Facebook as the logged in user. Apps created from today onwards will not have access to this permission. Apps created before today that have been previously approved to request publish_actions can continue to do so until August 1, 2018. No further apps will be approved to usepublish_actions
via app review. Developers currently utilizingpublish_actions
are encouraged to switch to Facebook's Share dialogs for web, iOS and Android.
New Facebook Platform Product Changes and Policy Updates
For my usecase, I'm still screwed, unless I become an “approved partner:”
On August 1st, 2018, the Live API publish_actions permission, which allows an app to publish on behalf of its Users, will be reserved for approved partners. A new permission model that allows apps to publish Videos to their User's Groups and Timeline will be created instead.
New Facebook Platform Product Changes and Policy Updates
So, I'm still screwed.
You can do a “share dialog” which looks like it may work, but … it looks like it may require the use of JavaScript (a non-starter for me, so I'm still screwed) and the user has to be logged into FaceMeLinkedMyBookWeInSpace for it to work (another non-starter for me). This may work for Smirk, and more importantly, it doesn't require a FaceMeLinkedMyBookWeInSpace app to be written.
Then there's this Pages API thing that looks like it could work (not for me, because I don't think my “timeline” counts as a “page”—man this MyFaceMeLinkedInSpaceBookWe stuff is confusing) but it requires building an app. If what Smirk wants to publish to on FaceMeLinked MyBookWeInSpace is a “page” then this is probably what he wants. While it may look like “Instant Articles” is the way to go, especially since there appear to be plugins for popular web publishing platforms, but the kicker there is: “[t]he final step before going live is to submit 10 complete articles for review by our team.” That may work for The National Enquirer or the Weekly World News, but it won't work for me or Smirk.
And that's for getting stuff into MeLinkedMyFaceBookWeInSpace. As far as I can tell, there's no way to get stuff out! And that's probably by design—LinkedMyFaceMeSpaceBookWeIn is the Intarwebs, as far as it's concerned.
Thursday, July 30, 2020
I can't believe I didn't think of that—clarification
Over on MeLinkedInstaMyFaceInGramSpaceBookWe, my friend Brian commented “Cool! (Whatever it was exactly you did!!!)” about my work issue. Rereading that post, I think I can clarify a bit what I did. But first, a disclaimer: I'm not revealing any personal information here as all the data for the regression test is randomly generated. Names, numbers, it's all generated data.
So I ran the test and the output from that is a file that looks like (feature names changed to protect me):
ERR CNAM feature8 failed: wanted "VINCENZA GALJOUR" got "" testcase = { id = "3.0037", orig = { number = "2012013877", person = { business = "-", first = "VINCENZA", name = "VINCENZA GALJOUR", last = "GALJOUR", }, feature9 = false, cnam = true, extcnam = false, feature4 = true, feature10 = false, feature7 = false, feature8 = true, }, term = { feature10 = true, feature1 = false, feature2 = false, feature3 = false, feature4 = true, feature5 = false, feature6 = false, number = "6012013877", feature7 = false, feature8 = false, }, } ERR CNAM feature8 failed: wanted "TERINA SCHUPP" got "" testcase = { id = "3.0039", orig = { number = "2012013879", person = { business = "-", first = "TERINA", name = "TERINA SCHUPP", last = "SCHUPP", }, feature9 = false, cnam = true, extcnam = false, feature4 = true, feature10 = false, feature7 = false, feature8 = true, }, term = { feature10 = true, feature1 = false, feature2 = false, feature3 = false, feature4 = true, feature5 = false, feature6 = false, number = "6012013879", feature7 = false, feature8 = false, }, }
Since the regression test is written in Lua,
I found it easy to just dump the structure holding the test data to the file,
given I already have have a function to do so.
I also print out what failed just before the data for that particular test case.
The code that prints the structure outputs valid Lua code.
All I changed was adding an array declaration around the output,
turned the error message into a comment,
and changed testcase
to a valid array index:
testcase = { -- ERR CNAM feature8 failed: wanted "VINCENZA GALJOUR" got "" [1] = { id = "3.0037", orig = { number = "2012013877", person = { business = "-", first = "VINCENZA", name = "VINCENZA GALJOUR", last = "GALJOUR", }, feature9 = false, cnam = true, extcnam = false, feature4 = true, feature10 = false, feature7 = false, feature8 = true, }, term = { feature10 = true, feature1 = false, feature2 = false, feature3 = false, feature4 = true, feature5 = false, feature6 = false, number = "6012013877", feature7 = false, feature8 = false, }, } -- ERR CNAM feature8 failed: wanted "TERINA SCHUPP" got "" [2] = { id = "3.0039", orig = { number = "2012013879", person = { business = "-", first = "TERINA", name = "TERINA SCHUPP", last = "SCHUPP", }, feature9 = false, cnam = true, extcnam = false, feature4 = true, feature10 = false, feature7 = false, feature8 = true, }, term = { feature10 = true, feature1 = false, feature2 = false, feature3 = false, feature4 = true, feature5 = false, feature6 = false, number = "6012013879", feature7 = false, feature8 = false, }, } }
That way, I can verify my hypothesis with some simple Lua code:
dofile "errorlog.txt" for _,result in ipairs(testcase) do if not (result.feature10 and (result.feature8 or result.feature4)) then print("hypothesis failed") end end