The Boston Diaries

The ongoing saga of a programmer who doesn't live in Boston, nor does he even like Boston, but yet named his weblog/journal “The Boston Diaries.”

Go figure.

Saturday, January 01, 2011

Code and Data, Theory and Practice

From
Steve Crane <XXXXXXXXXXXXXXXXXXXXX>
To
Sean Conner <sean@conman.org>
Subject
@siwisdom twitter feed
Date
Sat, 1 Jan 2011 15:46:14 +0200

Hi Sean,

Are you aware that the quotation marks in the @siwisdom <http://twitter.com/siwisdom> tweets display as &ldquo; and &rdquo; in clients like TweetDeck? Perhaps you should switch to using regular ASCII double quotes.

Regards and Happy New Year.

Yes, I'm aware. They show up on the main Twitter page as well, and there isn't much I can do about it, other than sticking exclusively with ASCII and forgoing the nice typographic characters. It appears to be related to this rabbit hole, only in a way that's completely different.

What's going on here is explained here:

We have to encode HTML entities to prevent XSS attacks. Sorry about the lost characters.

counting messages: characters vs. bytes, HTML entities - Twitter Development Talk | Google Groups

And XSS has nothing to do with attacking one website from another, but everything to do with the proliferation of character encoding schemes and the desire to fling bits of executable code (aka ``Javascript'') along with our bits of non-exectuable data (aka ``HTML''). The problem is keeping the bits of executable code (aka ``Javascript'') from showing up where it isn't expected.

But in the case of Twitter, I don't think they actually understand how their own stack works. Or they just took the easy way out and any of the ``special'' characters in HTML, like ``&'', ``<'' and ``>'' are automatically converted to their HTML entity equivelents ``&amp;'', ``&lt;'' and ``&gt;''. Otherwise, to sanitize the input, they would need to do the following:

  1. get the raw input from the HTML form
  2. convert the input from the transport encoding (usually URL encoding but it could be something else, depending upon the form)
  3. possibly convert the string into a workable character set the program understands (say, the browser sent the character data in WINDOWS-1251, because Microsoft is like that, to something a bit easier to work with, say, UTF-8)
  4. if HTML is allowed, sanitize the HTML by
    1. removing unsupported or dangerous tags, like <SCRIPT>, <EMBED> and <OBJECT>
    2. removing dangerous attributes like STYLE or ONMOUSEOVER
    3. check remaining attributes (like HREF) for dangerous content (like javascript:alert('1 h@v3 h@cxx0r3d ur c0mput3r!!!!!!!11111'))
  5. escape the data to work with it properly (or else face the wrath of Little Bobby Tables' mother).

Fail to do any of those steps, and well … “1 h@v3 h@cxx0r3d ur c0mput3r!!!!!!!11111” And besides, I'm probably missing some sanitizing step somewhere.

Now, I could convert the input I give to Twitter to UTF-8 and avoid HTML entities entirely, but then I would have to convert my blog engine to UTF-8 (because I display my Twitter feed in the sidebar) and while it may work just fine with UTF-8, I haven't tested it with UTF-8 data. And I would prefer to keep it in US-ASCII to avoid any nasty surprises.

Besides, I shouldn't have to do this, because that's why HTML entities were designed in the first place—as a way of presenting characters when a character set doesn't support those characters!

Hey—wait a second … what's this river doing here?

Obligatory Picture

[It's the most wonderful time of the year!]

Obligatory Links

Obligatory Miscellaneous

You have my permission to link freely to any entry here. Go ahead, I won't bite. I promise.

The dates are the permanent links to that day's entries (or entry, if there is only one entry). The titles are the permanent links to that entry only. The format for the links are simple: Start with the base link for this site: http://boston.conman.org/, then add the date you are interested in, say 2000/08/01, so that would make the final URL:

http://boston.conman.org/2000/08/01

You can also specify the entire month by leaving off the day portion. You can even select an arbitrary portion of time.

You may also note subtle shading of the links and that's intentional: the “closer” the link is (relative to the page) the “brighter” it appears. It's an experiment in using color shading to denote the distance a link is from here. If you don't notice it, don't worry; it's not all that important.

It is assumed that every brand name, slogan, corporate name, symbol, design element, et cetera mentioned in these pages is a protected and/or trademarked entity, the sole property of its owner(s), and acknowledgement of this status is implied.

Copyright © 1999-2019 by Sean Conner. All Rights Reserved.