The Boston Diaries

The ongoing saga of a programmer who doesn't live in Boston, nor does he even like Boston, but yet named his weblog/journal “The Boston Diaries.”

Go figure.

Saturday, January 01, 2011

What? Not another 365 days …


Even though we still don't have flying cars, and we never did go back to Jupiter (heck, we never got there in the first place), maybe, just maybe, things will improve in 2011.

And remember, only 721 or 723 days left until The End Of the World As We Know It (but I feel fine).

(This post brought to you by EPS. Had this been an actual post, it most likely would have been exceedingly late)

Code and Data, Theory and Practice

Sean Conner <>
@siwisdom twitter feed
Sat, 1 Jan 2011 15:46:14 +0200

Hi Sean,

Are you aware that the quotation marks in the @siwisdom <> tweets display as &ldquo; and &rdquo; in clients like TweetDeck? Perhaps you should switch to using regular ASCII double quotes.

Regards and Happy New Year.

Yes, I'm aware. They show up on the main Twitter page as well, and there isn't much I can do about it, other than sticking exclusively with ASCII and forgoing the nice typographic characters. It appears to be related to this rabbit hole, only in a way that's completely different.

What's going on here is explained here:

We have to encode HTML entities to prevent XSS attacks. Sorry about the lost characters.

counting messages: characters vs. bytes, HTML entities - Twitter Development Talk | Google Groups

And XSS has nothing to do with attacking one website from another, but everything to do with the proliferation of character encoding schemes and the desire to fling bits of executable code (aka ``Javascript'') along with our bits of non-exectuable data (aka ``HTML''). The problem is keeping the bits of executable code (aka ``Javascript'') from showing up where it isn't expected.

But in the case of Twitter, I don't think they actually understand how their own stack works. Or they just took the easy way out and any of the ``special'' characters in HTML, like ``&'', ``<'' and ``>'' are automatically converted to their HTML entity equivelents ``&amp;'', ``&lt;'' and ``&gt;''. Otherwise, to sanitize the input, they would need to do the following:

  1. get the raw input from the HTML form
  2. convert the input from the transport encoding (usually URL encoding but it could be something else, depending upon the form)
  3. possibly convert the string into a workable character set the program understands (say, the browser sent the character data in WINDOWS-1251, because Microsoft is like that, to something a bit easier to work with, say, UTF-8)
  4. if HTML is allowed, sanitize the HTML by
    1. removing unsupported or dangerous tags, like <SCRIPT>, <EMBED> and <OBJECT>
    2. removing dangerous attributes like STYLE or ONMOUSEOVER
    3. check remaining attributes (like HREF) for dangerous content (like javascript:alert('1 h@v3 h@cxx0r3d ur c0mput3r!!!!!!!11111'))
  5. escape the data to work with it properly (or else face the wrath of Little Bobby Tables' mother).

Fail to do any of those steps, and well … “1 h@v3 h@cxx0r3d ur c0mput3r!!!!!!!11111” And besides, I'm probably missing some sanitizing step somewhere.

Now, I could convert the input I give to Twitter to UTF-8 and avoid HTML entities entirely, but then I would have to convert my blog engine to UTF-8 (because I display my Twitter feed in the sidebar) and while it may work just fine with UTF-8, I haven't tested it with UTF-8 data. And I would prefer to keep it in US-ASCII to avoid any nasty surprises.

Besides, I shouldn't have to do this, because that's why HTML entities were designed in the first place—as a way of presenting characters when a character set doesn't support those characters!

Hey—wait a second … what's this river doing here?

Sunday, January 02, 2011

The inexorable march of technology

Sean Conner <>
Re: @siwisdom twitter feed
Sun, 2 Jan 2011 10:53:12 +0200

Hi Sean,

I see you blogged about this, stating that the problem occurs in Twitter's own web client too. I found that interesting as I don't see the problem in the Twitter web client. As shown in this screenshot it all looks fine.

[SiWisdom on Chrome]

I'm using Google Chrome and I wondered if the browser may be playing some part so I opened the same page in Internet Explorer, which I only use under duress, and what a difference.

[SiWisdom on Crack—I mean on IE]

There's the problem, as large as life, but the UI looks different. I recall that I opted to use the new UI a while ago and not being logged in Twitter must showing me the old UI, presumably still the default. So I log in and voila, the same UI I see in Chrome, with all characters represented correctly.

[SiWisdom strung out—I mean on IE]

So we can conclude from this that there was indeed a problem; it was Twitter's fault; and they have addressed it in their new UI. This leaves only third party clients to fXXX things up for you.


Oh, now he tells me, after I make sure mod_blog works with UTF-8 (it handled a post written in a Runic alphabet just fine), reconfigured the webserver to serve up the appropriate headers and converted the data file for Silicone Wisdom to UTF-8.

Okay, it didn't take me that long, and as my friend Jeff said:

It is no longer an ASCII world, it's a Unicode one. I have run in to this a lot and anything that uses XML, like a Twitter feed is going to be encoded into UTF-8. It is profoundly annoying, but spreading.

And progress marches on …

Update a few seconds later …

The automatic posting to MyFaceSpaceBook revealed a similar issue. It is to laugh.

Monday, January 03, 2011

UTF-8 is hard. Let's write device drivers!

Nice. I switch my Stupid Twitter Trick to use UTF-8, and this is what I get for my trouble:

HTTP/1.1 401 Unauthorized
Date: Mon, 03 Jan 2011 12:29:02 GMT
Server: hi
Status: 401 Unauthorized
WWW-Authenticate: Basic realm="Twitter API"
Content-Type: application/json; charset=utf-8
Content-Length: 70
Cache-Control: no-cache, max-age=300
Expires: Mon, 03 Jan 2011 12:34:02 GMT
Vary: Accept-Encoding
Connection: close

{"request":"\/1\/statuses\/update.json","error":"Incorrect signature"}

No UTF-8 characters, posts fine. Any UTF-8 characters, I get this. I think the problem is that the character set isn't being sent along with the post and I don't have a clue about how that's done. Nor do I think there's an actual standard for this (I don't recall ever coming across one).


This UTF-8 stuff is hard. No, I mean hard. I wish I was making that up.

Tuesday, January 04, 2011

It's hard to break a program when the network keeps breaking

One of my jobs at The Corporation has been to load test a phone network service by simulating a ton of calls and see where our service breaks. Which means I get to write code to simulate a phone network initiating calls. It's not easy, and no, it's not entirely related to The Protocol Stack From Hell™ (but don't get me wrong—there's still plenty of blame to go around).

Problem the first: generating a given load, a measured set of packets per second. It's harder than it sounds. Under the operating system we're using, the smallest unit we can reliably pause is 1/100 of a second. If I want to send out 500 messages per second, the best I can do is 5 packets every 0.01 seconds, which isn't the same as one packet every 0.002 seconds (even though it averages out). The end result tends towards bursty traffic (that is, if I attempt to control the rate; if I don't bother with that, I tend to break the phone network connection—more on that below).

Sure, there's some form of congestion control in The Protocol Stack From Hell™, but attempting to integrate the sample code provided into my testing program failed—not only do I not get the proper messages, but what I do get is completely different from the sample program. This is compounded by the documentation (which everybody agrees is completely worthless) and the fact that this is the first time I've ever worked on anything remotely related to telephony. I'm unfamiliar with the protocols, and with the ins and outs of The Protocol Stack From Hell™ (unlike my manager R, who's worked with this stuff for the past fifteen to twenty years, but is swamped with other, manager-type work).

Now the second problem: even though the testing system is in the same cabinet, and hooked to the same network switch, as the target system (in fact, I think they're physcially touching each other) due to the nature of the phone system, communications between phone network components must go through an intermediary system known as an STP; actually, a pair of STPs (for redundancy). Unfortunately for us, the only STP we have access to is out in Washington State (where The Corporate Master Headquarters are stationed) and said traffic between our two testing systems (here in Lower Sheol) goes back and forth across the Inernet over a VPN.

Yeah, what I can say? When I asked about getting an STP a bit closer to us, I was told it wasn't in the budget (and no wonder—the price is far into the “if you have to ask, you can't afford it” territory—deep into that territory).

So we're stuck with a six thousand mile round trip for the phone network traffic. And now we come to the punch line—the Internet is broken. The quick synopsis: excessive buffering of Internet traffic by various routers is causing a breakdown of anti-congestion algorithms used by TCP/IP. Now, the article is talking about buffer bloat in consumer grade equipment, but it is possible that there are commercial grade routers doing the same thing—excessive buffering and that could be the cause of largish spikes in traffic, as well as increased latency in round trips. If there's a spike in traffic, the STP will attempt to assert flow control, but if there's still traffic coming in, it's considered an error. Also, the phone network is very time sensitive and execessive latencies are also an error condition.

Worse, if the STP receives too many errors from an endpoint, it (like every other STP on the phone network) is programmed to take that endpoint out of service. It's hard to say where that point is, but it happens with frightening regularity when I attempt to do load testing. The packets are being pushed when suddenly we start receiving either canceled messages, or delivery failures about five levels down in the protocol stack, which means one (or both) endpoints have been cut loose from the phone network due to excessive errors. It then requires manual intervention to restart the entire stack on both sides.

So, there's bursty traffic due to my attempts at sending a settable amount of traffic. Then there's the (possible) bursty traffic due to excessive buffering across the Internet. Oh, and I forgot to mention the licensing restriction on The Protocol Stack From Hell™ that limits the number of messages we can send and receive. All that makes it quite difficult to find the breaking point of the program I'm testing. I keep breaking the communications channel.

That tends to put a damper on load testing.

Friday, January 07, 2011

Why did I not get the memo?

In what is sure to be one of the most acclaimed comics events of 2011, Fantagraphics has announced that they will be publishing a definitive collection of Carl Barks' seminal run of Donald Duck comic stories. In an exclusive interview with Robot 6, Fantagraphics co-publisher Gary Groth revealed that the company—which announced their plans to publish Floyd Gottfredson's Mickey Mouse comics last summer—had acquired the rights to reprint Barks' work from Disney and that the first volume will be released in fall of this year. The comics will be published in hardcover volumes, with two volumes coming out every year, at a price of about $25 per volume.

Via Eccentric Flower, Exclusive: Fantagraphics to publish the complete Carl Barks | Robot 6 @ Comic Book Resources – Covering Comic Book News and Entertainment


Obligatory Picture

[Don't hate me for my sock monkey headphones.]

Obligatory Links

Obligatory Miscellaneous

You have my permission to link freely to any entry here. Go ahead, I won't bite. I promise.

The dates are the permanent links to that day's entries (or entry, if there is only one entry). The titles are the permanent links to that entry only. The format for the links are simple: Start with the base link for this site:, then add the date you are interested in, say 2000/08/01, so that would make the final URL:

You can also specify the entire month by leaving off the day portion. You can even select an arbitrary portion of time.

You may also note subtle shading of the links and that's intentional: the “closer” the link is (relative to the page) the “brighter” it appears. It's an experiment in using color shading to denote the distance a link is from here. If you don't notice it, don't worry; it's not all that important.

It is assumed that every brand name, slogan, corporate name, symbol, design element, et cetera mentioned in these pages is a protected and/or trademarked entity, the sole property of its owner(s), and acknowledgement of this status is implied.

Copyright © 1999-2017 by Sean Conner. All Rights Reserved.