Saturday, Debtember 04, 2004
Building character
I've been working on mod_blog
for the past few days.
Bob, who runs the Friday D&D game, has a site that used to use Blogger but due to
reliability problems (as well as certain security issues related to FTP) I switched them over to
mod_blog
.
It's installed, but the input portion is … shall we say … less than user friendly. Frankly, it suits my needs, and it suited the needs for Mark (back when he had a blog) and a variation of it suits Spring, but that's because the three of us know HTML and aren't intimidated by it. But most of the other players don't know, or don't care, to type their entries with HTML so I've been adding code to allow them to input plain text and have it converted to HTML.
It seems like it would be straightforward, but it isn't. Like I said,
I'm confortable with HTML and the input side of things reflected that
comfort. The first problem was adding a text-based (for lack of a better
term) plug-in. The second (and more annoying) problem is the utter horror
that are Buffers
.
In my CGI library
(which I've been developing for easily seven years) I have a concept of a
“buffer,” which is 1) horribly misnamed—it's not so much a “buffer” as
it is a “stream”—either a stream of input or a stream of output and 2)
is so buggy as to be nearly useless. The problem stems from my attempts to
support both “read” and “write” methods on a given buffer (did I mention
it's really a stream and not a buffer?) and that's why they're so buggy as
to be nearly useless. There's only this vauge notion of reading and writing
and detecting the end of the stream. I've been having to go in kludging up
fixes for the various buffer modules I have, allowing me to write to a
buffer, then backup and start reading from said buffer. And 3) what the
XXXX was I thinking when I created
LineBuffer
s? They shouldn't exist, period. I think.
Anyway, it's not pretty, and the fixes are rather ad-hoc and I have to “know” which type of buffer I'm dealing with and what I can and can't do with it. Which defeats the purpose of abstracting things behind a “buffer” in the first place.
And then there's the third problem. This is a doozy and it affects nearly every other blogging software out there as well. It's the “copy-n-paste” problem (the linked article explains why a certain page appears corrupted, but it ultimately stemmed from a copy-n-paste operation). What exactly is the “copy-n-paste” problem?
It stems from character
encodings and the lack of character encoding information when you
copy-n-paste text between applications that have different ideas of what
character encoding it is expecting. For instance, viewing a Microsoft Word
document using the character set WINDOWS-1252
, and copy that
into a website that might be expected ISO-8859-1
,
UTF-8
or even US-ASCII
(like who ever would
use US-ASCII
? Sheesh!). You know what you can
expect?
Γªρßåγ€
It really bugs me when I see stuff like “sensorâ€and” on a page.
I could try to use the ACCEPT-CHARSET
attribute of the <FORM>
tag in HTML 4.01, but really, that's
just a “hint” to the browser on what to send—it doesn't actually have to
pay attention to that at all. To get around that little
problem I'm playing around with GNU's libiconv
(a large library to convert from one character encoding scheme to another)
in an attempt to prevent this problem. You input the text, then I scan it,
attempting to classify what character encoding scheme is in use (and right
now it only detects US-ASCII
, ISO-8859-1
,
WINDOWS-1252
and UTF-8
) then converts whatever it
finds into Unicode
(specifically UCS-4) then back to US-ASCII
, using HTML numeric entities for
anything outside of US-ASCII
.
All that, just to support easy text entry into a blog, without having it look like crap in case someone decides to copy-n-paste from some other application.
And that's why adding a text plug-in isn't that straight forward.