Sunday, October 29, 2023
Adventures in Utext
There is one point on the ASCII ↔︎ JS spectrum that I haven’t seen, and it’s one that, as I use Unicode in more complex ways on Gwern.net and have learned how many obscure features or characters Unicode has, I increasingly think has been neglected: only UTF-8 text rendered by a monospace font. Not ASCII, not a weird subset of SGML, not troff, not raw terminal codes, not bitmaps encoded in ASCII—just UTF-8. This document format does only what pure Unicode text can do—but does everything that pure Unicode can do, which turns out to be a lot. What if we take Unicode literally, but not seriously?
Your typical plain text output strips all formatting. At the most ambitious, it might have a Unicode superscript or fraction. But we can do so much more!
Utext: Rich Unicode Documents · Gwern.net
That was an interesting read (your mileage may vary).
To generate the gopher and Gemini versions of my blog,
I parse the HTML and generate either plain text
(for gopher)
or Gemtext for Gemini.
And I'm still not entirely happy with the output.
For emphasized text,
I would translate that to “*emphasized*”,
which is … okay,
I guess?
And for deleted text—that was a harder to deal with,
and I ended up with “[DELETED-deleted-DELETED]” text.
There's no excuse for that.
But after reading about Utext, and Uncode's COMBINING SHORT STROKE OVERLAY and COMBINING LOW LINE I thought I might try using those for some typographical niceties that you don't normally get with plain text. And that's when I learned that not all virtual terminals support all of Unicode all that well. And wraping text is … not that trivial anymore.
Ah well. For now, it seems to be working, but it remains to be seen if I like the results.
Update on Friday, December 8th, 2023
I reverted this change due to issues.