Tuesday, August 08, 2006
A long, slightly rambling, but deeply technical, entry follows, so if you aren't interested in software internationalization, character sets and variable type systems, you might want to skip this entry entirely; you have been warned.
how could Planet RDF do things better?
Danny, you can see the problem above.
Is the Planet RDF code available? I would gladly provide a patch.
The problem is not in your input feed. You contain the following code: ’. Planet RDF converts that into binary, which I will express in hex:
xC3A2C280C299. The correct formulation would be
xE28099. I've got a good idea about what is going on under the covers as
- â => xC3A2
- € => xC280
- ™ => xC299
In other words, for some reason, RDF Planet is effectively doing a
It's a mess.
Part of the problem I'm sure is that not many programmers are used to i18n. And part of it is not many programmers are aware of what data is in what format at what part of the processing. And that is related to variable types in programming.
The first problem is i18n. I won't say it's trivial, but
it does take some forethought to handle it. Thirty years ago it was
simple—you use the character set of whatever country you were in and that
was that. Since computers work with numbers, a “character set” is just a
mapping of values (say,
1) to a visual representation of a
character, a “glyph” (say, “A”). Here in the US, we used ASCII, where in this case, the character
“A” is internally represented with the value
65 (and the
lower case “a” by a different value,
97 since technically, it
is a different glyph). The character “A” in say, Freedonia might
be stored as the value
No problem. Well, unless your software was sold internationally, then you had a problem of handing different character sets. And there are a lot of them.
So a solution (and it's a pretty good solution) is to use some internal representation of every possible character set as the programming is running (say, Unicode, which was defined to handle pretty much any written language) and do any conversions on input and output to the locally defined character set (I've found the GNU iconv library to be pretty easy to use and it can handle a ton of different character sets). Yes, there are some really obscure corner cases doing this, but for the most part, it'll handle perhaps 95% of all i18n issues.
But web based documents—or rather, documents based upon HTML/XML—present a wrinkle. HTML/XML use the “<” character (value of
60) to start the beginning of tag, like
(in this case, this particular tag denotes the start of a paragraph). But
what if the content (that is, non-tag portion) actually requires a literal
“<”? What then? Well, HTML/XM use another character, “&”, to
encode characters that you would otherwise be unable to use. So, to get a
“<” you would write it (or “encode it”) as
which means that to use a “&” in your document, you need to write it
But you can use this form, called “entities” (or HTML entities) to include other characters that might otherwise not be part of your local character set, such as the much nicer typographical quotes “” (compare to the computerish ""), or even the star in the middle of Wal★Mart.
Now, most of these entities, like
& are named, such
“ for the nice looking opening double quotes, or
→ for a right arrow. But a lot don't have names, like
the star in the middle of Wal★Mart. To get those, you need to use a
numeric value, like
But, where do you get these values?
Well, they're Unicode values.
So, even if you're writing your page in a Slavic langauge (using for
instance, the character set defined as ISO-8859-5), the “á” can
still be encoded as
á, using the Unicode value for
“á” (since ISO-8859-5 doesn't contain that glyph). To further
confound things (and seeing how for this example, we're using ISO-8859-5) the
character “Љ” can be inserted as a character with value 169 (whereas
if viewed as an ISO-8859-1 character, it would be “©”)
Љ (which would work reguardless of the
character set you are using for your document).
If not, you're in good company with lots of other programmers out there. And that's even before we get to the issue of submitting text to a CGI script (I'll get to that later).
And that leads to the second problem, where programmers don't know what's what where in their program. And that's related to variable types. Or rather, the lack of variable types. Or rather, the rather lackadaisical approach to typing most modern computer languages have today (Python? Perl? I'm looking at you).
In today's modern languages, unlike slightly older languages like C or
Pascal, you don't really need to declare your variables before you use
them—you can just use them. Sure, you can predeclare them (and
if you are smart, you do) but you don't have to, and even when you
do, you just declare that
foobar is a variable, not what it can
hold (well, granted, in Perl, you have to tell it if the variable can
contain a single value, or a list of values, or a table of values). As Lisp
programmers are wont to say: “a variable doesn't have a type,
Now, why is that such a problem?
Well, other than muddle headed thinking, there's overhead at runtime in determining if the operation is permitted on the type of the value held in a variable and if not, what can we do with the value in the variable to make it the right type for the operation in question. So, in the following Perl code:
$result = $a + $b;
at run time the computer has to determine the types of the
values of the variables
$b—if numeric it
can just add them; if strings, it has to convert the strings to a numeric
value and then add them. And what if
$a contains the
string “one” and
$b contains the string “two”?
$result will have the numeric value of 0 because “one” and
“two” do not convert to numeric values (the ol “Garbage in, garbage out”
saying in Computer Science; oh, you were expecting maybe “onetwo”? That's
a different operator altogether).
But even if we just stick with strings (and HTML processing is a lot of string handling) we're not really out of the woods. Even if we were to use a computer language with stonger typing (say, Ada) that wouldn't even help us because the type systems of any language today just aren't up to it.
For instance, given a simple HTML form that accepts input, we can feed it the following:
Now, when the data is submitted, the browser will encode the data in yet
another encoding scheme (
to be precise), which can contain HTML encoded entities (see above) in any number of
character sets (although, unless otherwise stated, in the same character set
the browser thinks the page that has the form uses). Using the
latest version of Firefox (1.5 as of this writing), if instructed to send
the data using charset US-ASCII
the data I get is:
Which, when decoded, is:
But, if the data is sent as charset ISO-8859-1:
Which, when decoded, is:
The quotes and dashes (the “mdash” and “ndash” respectively) are single characters but the star is still sent as an HTML entity. Yet if the browser is told to send the data as UTF-8:
Which, when decoded, is:
So, what does all this have to do with variable types and the weakness of typechecking even in a language like Ada?
Well, if we could declare our variables with not only the type (in this case, “string”) but with additional clarifications on said type (say, “charset ISO-8859-5 with HTML entities” or “charset UTF-8 encoded as application/x-www-form-urlencoded”) it would clarify the data flow through a lot of web based software and prevent stupid mistakes (and yes, it would piss off a lot of muddle-headed programmers but this would at least force them to think for once).
Update a few moments after posting
All this talk about encoding issues, and I still blew it and had to manually fix some encoding issues on this page.