Tuesday, August 08, 2006
A long, slightly rambling, but deeply technical, entry follows, so if you aren't interested in software internationalization, character sets and variable type systems, you might want to skip this entry entirely; you have been warned.
how could Planet RDF do things better?
Danny, you can see the problem above.
Is the Planet RDF code available? I would gladly provide a patch.
The problem is not in your input feed. You contain the following code: ’. Planet RDF converts that into binary, which I will express in hex:
xC3A2C280C299
. The correct formulation would bexE28099
. I've got a good idea about what is going on under the covers as
- â => xC3A2
- € => xC280
- ™ => xC299
In other words, for some reason, RDF Planet is effectively doing a
iso-8859-1
toutf-8
conversion, onutf-8
data.
It's a mess.
Part of the problem I'm sure is that not many programmers are used to i18n. And part of it is not many programmers are aware of what data is in what format at what part of the processing. And that is related to variable types in programming.
The first problem is i18n. I
won't say it's trivial, but it does take some forethought to handle it.
Thirty years ago it was simple—you use the character set of whatever country
you were in and that was that. Since computers work with numbers, a
“character set” is just a mapping of values (say, 1
) to a visual
representation of a character, a “glyph” (say, “A”). Here in the US, we used ASCII, where in this case, the character
“A” is internally represented with the value 65
(and the lower
case “a” by a different value, 97
since technically, it
is a different glyph). The character “A” in say, Freedonia might be
stored as the value 193
.
No problem. Well, unless your software was sold internationally, then you had a problem of handing different character sets. And there are a lot of them.
So a solution (and it's a pretty good solution) is to use some internal representation of every possible character set as the programming is running (say, Unicode, which was defined to handle pretty much any written language) and do any conversions on input and output to the locally defined character set (I've found the GNU iconv library to be pretty easy to use and it can handle a ton of different character sets). Yes, there are some really obscure corner cases doing this, but for the most part, it'll handle perhaps 95% of all i18n issues.
But web based documents—or rather, documents based upon HTML/XML—present a wrinkle. HTML/XML use
the “<” character (value of 60
) to start the beginning of
tag, like <P>
(in this case, this particular tag denotes
the start of a paragraph). But what if the content (that is, non-tag portion)
actually requires a literal “<”? What then? Well, HTML/XM use another character, “&”, to
encode characters that you would otherwise be unable to use. So, to get a
“<” you would write it (or “encode it”) as <
, which
means that to use a “&” in your document, you need to write it as
&
.
But you can use this form, called “entities” (or HTML entities) to include other characters that might otherwise not be part of your local character set, such as the much nicer typographical quotes “” (compare to the computerish ""), or even the star in the middle of Wal★Mart.
Now, most of these entities, like &
are named, such
as “
for the nice looking opening double quotes, or
→
for a right arrow. But a lot don't have names, like
the star in the middle of Wal★Mart. To get those, you need to use a numeric
value, like ★
.
But, where do you get these values?
Well, they're Unicode values.
So, even if you're writing your page in a Slavic langauge (using for
instance, the character set defined as ISO-8859-5), the “á” can still be encoded
as á
, using the Unicode value for “á” (since ISO-8859-5 doesn't
contain that glyph). To further confound things (and seeing how for this
example, we're using ISO-8859-5) the character “Љ” can be inserted as a
character with value 169 (whereas if viewed as an ISO-8859-1 character, it would be “©”)
or as Љ
(which would work reguardless of the
character set you are using for your document).
Got it?
If not, you're in good company with lots of other programmers out there. And that's even before we get to the issue of submitting text to a CGI script (I'll get to that later).
And that leads to the second problem, where programmers don't know what's what where in their program. And that's related to variable types. Or rather, the lack of variable types. Or rather, the rather lackadaisical approach to typing most modern computer languages have today (Python? Perl? I'm looking at you).
In today's modern languages, unlike slightly older languages like C or
Pascal, you don't really need to declare your variables before you use them—
you can just use them. Sure, you can predeclare them (and if you are
smart, you do) but you don't have to, and even when you do, you just
declare that foobar
is a variable, not what it can hold (well,
granted, in Perl, you have to tell it if the variable can contain a single
value, or a list of values, or a table of values). As Lisp programmers are
wont to say: “a variable doesn't have a type, values do.”
Now, why is that such a problem?
Well, other than muddle headed thinking, there's overhead at runtime in determining if the operation is permitted on the type of the value held in a variable and if not, what can we do with the value in the variable to make it the right type for the operation in question. So, in the following Perl code:
$result = $a + $b;
at run time the computer has to determine the types of the values
of the variables $a
and $b
—if numeric it can just
add them; if strings, it has to convert the strings to a numeric value and
then add them. And what if $a
contains the string “one”
and $b
contains the string “two”? $result
will have
the numeric value of 0 because “one” and “two” do not convert to numeric
values (the ol “Garbage in, garbage out” saying in Computer Science; oh, you
were expecting maybe “onetwo”? That's a different operator altogether).
But even if we just stick with strings (and HTML processing is a lot of string handling) we're not really out of the woods. Even if we were to use a computer language with stonger typing (say, Ada) that wouldn't even help us because the type systems of any language today just aren't up to it.
For instance, given a simple HTML form that accepts input, we can feed it the following:
“This—is–a★test”
Now, when the data is submitted, the browser will encode the data in yet
another encoding scheme (application/x-www-form-urlencoded
to be precise), which can
contain HTML encoded entities
(see above) in any number of character sets (although, unless otherwise
stated, in the same character set the browser thinks the page that
has the form uses). Using the latest version of Firefox (1.5 as of this
writing), if instructed to send the data using charset US-ASCII the data I get is:
%26%238220%3BThis%26%238212%3Bis%26%238211%3Ba%26%239733%3Btest%26%238221%3B
Which, when decoded, is:
“This—is–a★test”
But, if the data is sent as charset ISO-8859-1:
%93This%97is%96a%26%239733%3Btest%94
Which, when decoded, is:
“This—is–a★test”
The quotes and dashes (the “mdash” and “ndash” respectively) are single characters but the star is still sent as an HTML entity. Yet if the browser is told to send the data as UTF-8:
%E2%80%9CThis%E2%80%94is%E2%80%93a%E2%98%85test%E2%80%9D
Which, when decoded, is:
“This—is–a★test”
So, what does all this have to do with variable types and the weakness of typechecking even in a language like Ada?
Well, if we could declare our variables with not only the type (in this case, “string”) but with additional clarifications on said type (say, “charset ISO-8859- 5 with HTML entities” or “charset UTF-8 encoded as application/x-www-form-urlencoded”) it would clarify the data flow through a lot of web based software and prevent stupid mistakes (and yes, it would piss off a lot of muddle-headed programmers but this would at least force them to think for once).
Update a few moments after posting
All this talk about encoding issues, and I still blew it and had to manually fix some encoding issues on this page.
Sigh.