The Boston Diaries

The ongoing saga of a programmer who doesn't live in Boston, nor does he even like Boston, but yet named his weblog/journal “The Boston Diaries.”

Go figure.

Saturday, Debtember 04, 2004

Building character

I've been working on mod_blog for the past few days. Bob, who runs the Friday D&D game, has a site that used to use Blogger but due to reliability problems (as well as certain security issues related to FTP) I switched them over to mod_blog.

It's installed, but the input portion is … shall we say … less than user friendly. Frankly, it suits my needs, and it suited the needs for Mark (back when he had a blog) and a variation of it suits Spring, but that's because the three of us know HTML and aren't intimidated by it. But most of the other players don't know, or don't care, to type their entries with HTML so I've been adding code to allow them to input plain text and have it converted to HTML.

It seems like it would be straightforward, but it isn't. Like I said, I'm confortable with HTML and the input side of things reflected that comfort. The first problem was adding a text-based (for lack of a better term) plug-in. The second (and more annoying) problem is the utter horror that are Buffers.

In my CGI library (which I've been developing for easily seven years) I have a concept of a “buffer,” which is 1) horribly misnamed—it's not so much a “buffer” as it is a “stream”—either a stream of input or a stream of output and 2) is so buggy as to be nearly useless. The problem stems from my attempts to support both “read” and “write” methods on a given buffer (did I mention it's really a stream and not a buffer?) and that's why they're so buggy as to be nearly useless. There's only this vauge notion of reading and writing and detecting the end of the stream. I've been having to go in kludging up fixes for the various buffer modules I have, allowing me to write to a buffer, then backup and start reading from said buffer. And 3) what the XXXX was I thinking when I created LineBuffers? They shouldn't exist, period. I think.

Anyway, it's not pretty, and the fixes are rather ad-hoc and I have to “know” which type of buffer I'm dealing with and what I can and can't do with it. Which defeats the purpose of abstracting things behind a “buffer” in the first place.

And then there's the third problem. This is a doozy and it affects nearly every other blogging software out there as well. It's the “copy-n-paste” problem (the linked article explains why a certain page appears corrupted, but it ultimately stemmed from a copy-n-paste operation). What exactly is the “copy-n-paste” problem?

It stems from character encodings and the lack of character encoding information when you copy-n-paste text between applications that have different ideas of what character encoding it is expecting. For instance, viewing a Microsoft Word document using the character set WINDOWS-1252, and copy that into a website that might be expected ISO-8859-1, UTF-8 or even US-ASCII (like who ever would use US-ASCII? Sheesh!). You know what you can expect?


It really bugs me when I see stuff like “sensorâ€and” on a page.

I could try to use the ACCEPT-CHARSET attribute of the <FORM> tag in HTML 4.01, but really, that's just a “hint” to the browser on what to send—it doesn't actually have to pay attention to that at all. To get around that little problem I'm playing around with GNU's libiconv (a large library to convert from one character encoding scheme to another) in an attempt to prevent this problem. You input the text, then I scan it, attempting to classify what character encoding scheme is in use (and right now it only detects US-ASCII, ISO-8859-1, WINDOWS-1252 and UTF-8) then converts whatever it finds into Unicode (specifically UCS-4) then back to US-ASCII, using HTML numeric entities for anything outside of US-ASCII.

All that, just to support easy text entry into a blog, without having it look like crap in case someone decides to copy-n-paste from some other application.

And that's why adding a text plug-in isn't that straight forward.

Update early Sunday morning, December 5th, 2004

Some more details.

Obligatory Picture

[It's the most wonderful time of the year!]

Obligatory Links

Obligatory Miscellaneous

You have my permission to link freely to any entry here. Go ahead, I won't bite. I promise.

The dates are the permanent links to that day's entries (or entry, if there is only one entry). The titles are the permanent links to that entry only. The format for the links are simple: Start with the base link for this site:, then add the date you are interested in, say 2000/08/01, so that would make the final URL:

You can also specify the entire month by leaving off the day portion. You can even select an arbitrary portion of time.

You may also note subtle shading of the links and that's intentional: the “closer” the link is (relative to the page) the “brighter” it appears. It's an experiment in using color shading to denote the distance a link is from here. If you don't notice it, don't worry; it's not all that important.

It is assumed that every brand name, slogan, corporate name, symbol, design element, et cetera mentioned in these pages is a protected and/or trademarked entity, the sole property of its owner(s), and acknowledgement of this status is implied.

Copyright © 1999-2019 by Sean Conner. All Rights Reserved.