Building character

Saturday, Debtember 04, 2004

I've been working on mod_blog for the past few days. Bob, who runs the Friday D&D game, has a site that used to use Blogger but due to reliability problems (as well as certain security issues related to FTP) I switched them over to mod_blog.

It's installed, but the input portion is … shall we say … less than user friendly. Frankly, it suits my needs, and it suited the needs for Mark (back when he had a blog) and a variation of it suits Spring, but that's because the three of us know HTML and aren't intimidated by it. But most of the other players don't know, or don't care, to type their entries with HTML so I've been adding code to allow them to input plain text and have it converted to HTML.

It seems like it would be straightforward, but it isn't. Like I said, I'm confortable with HTML and the input side of things reflected that comfort. The first problem was adding a text-based (for lack of a better term) plug-in. The second (and more annoying) problem is the utter horror that are Buffers.

In my CGI library (which I've been developing for easily seven years) I have a concept of a “buffer,” which is 1) horribly misnamed—it's not so much a “buffer” as it is a “stream”—either a stream of input or a stream of output and 2) is so buggy as to be nearly useless. The problem stems from my attempts to support both “read” and “write” methods on a given buffer (did I mention it's really a stream and not a buffer?) and that's why they're so buggy as to be nearly useless. There's only this vauge notion of reading and writing and detecting the end of the stream. I've been having to go in kludging up fixes for the various buffer modules I have, allowing me to write to a buffer, then backup and start reading from said buffer. And 3) what the XXXX was I thinking when I created LineBuffers? They shouldn't exist, period. I think.

Anyway, it's not pretty, and the fixes are rather ad-hoc and I have to “know” which type of buffer I'm dealing with and what I can and can't do with it. Which defeats the purpose of abstracting things behind a “buffer” in the first place.

And then there's the third problem. This is a doozy and it affects nearly every other blogging software out there as well. It's the “copy-n-paste” problem (the linked article explains why a certain page appears corrupted, but it ultimately stemmed from a copy-n-paste operation). What exactly is the “copy-n-paste” problem?

It stems from character encodings and the lack of character encoding information when you copy-n-paste text between applications that have different ideas of what character encoding it is expecting. For instance, viewing a Microsoft Word document using the character set WINDOWS-1252, and copy that into a website that might be expected ISO-8859-1, UTF-8 or even US-ASCII (like who ever would use US-ASCII? Sheesh!). You know what you can expect?

Γªρßåγ€

It really bugs me when I see stuff like “sensorâ€and” on a page.

I could try to use the ACCEPT-CHARSET attribute of the <FORM> tag in HTML 4.01, but really, that's just a “hint” to the browser on what to send—it doesn't actually have to pay attention to that at all. To get around that little problem I'm playing around with GNU's libiconv (a large library to convert from one character encoding scheme to another) in an attempt to prevent this problem. You input the text, then I scan it, attempting to classify what character encoding scheme is in use (and right now it only detects US-ASCII, ISO-8859-1, WINDOWS-1252 and UTF-8) then converts whatever it finds into Unicode (specifically UCS-4) then back to US-ASCII, using HTML numeric entities for anything outside of US-ASCII.

All that, just to support easy text entry into a blog, without having it look like crap in case someone decides to copy-n-paste from some other application.

And that's why adding a text plug-in isn't that straight forward.

Update early Sunday morning, December 5^th, 2004

Some more details.

The Boston Diaries

Saturday, Debtember 04, 2004