The Boston Diaries

The ongoing saga of a programmer who doesn't live in Boston, nor does he even like Boston, but yet named his weblog/journal “The Boston Diaries.”

Go figure.

Sunday, Debtember 05, 2004

Building more character

It's even worse than I expected.

Or funnier, depending upon your viewpoint.

I'm follwing Sam Ruby's advice on Iñtërnâtiônàlizætiøn and testing what comes through as I copy-n- paste.

I'm using FireFox verion 1.0 under Linux for testing; I've yet to use Microsoft's Internet Explorer (which should prove to be dreadful and amusing at the same time).

So I'm copy-n-pasting the following from my last entry:

Γªρßåγ€

It really bugs me

Building character

And what I get back is:

text is WINDOWS-1252

00000000: 0D 0A 26 23 39 31 35 3B AA 26 23 39 36 31 3B DF ..Γ.ρ.
00000010: E5 26 23 39 34 37 3B 80 0D 0A 0D 0A 49 74 20 72 .γ.....It r
00000020: 65 61 6C 6C 79 20 62 75 67 73 20 6D 65          eally bugs me

Γªρßåγ€

It really bugs me

What the XXXX?

Not only is Firefox, under Linux, sending input to the test form as WINDOWS-1252 but it's also sending me HTML entities in numeric form! They're the correct numeric entities for the characters in question, but … what the XXXX?

To say I wasn't expecting this is a bit of an understatement.

Well, back to hacking.


Yet more character building

Back from hacking.

And it's been an interesting session. Learned quite a bit, and picked up some new tricks as well.

I'm doing some testing when I copy-n-pasted the following:

sensor—and? How did that happen?  Let's look at the O'Reilly source from which Cory copy and pasted:

The phone has become a platform, moving beyond mere voice to smart mobile sensor—and back to phone again, by way of voice-over-IP.

Sam Ruby: Copy and Paste

The program I wrote would classify the text as UTF-8, then iconv() would return an error. I rewrote the conversion routine so that when it failed (iconv() would return where it failed doing the conversion) I would re-classify the remaining text and continue.

Doing that, the text fragment above would be first tagged as UTF- 8, then WINDOWS-1252 and displayed it correctly:

sensor—and? How did that happen? Let's look at the O'Reilly source from which Cory copy and pasted:

The phone has become a platform, moving beyond mere voice to smart mobile sensor—and back to phone again, by way of voice-over-IP.

But if I copied the text twice, it would still be tagged as first UTF-8 then WINDOWS-1252, but the second copy would be incorrect:

sensor—and? How did that happen? Let's look at the O'Reilly source from which Cory copy and pasted:

The phone has become a platform, moving beyond mere voice to smart mobile sensor—and back to phone again, by way of voice-over-IP.

sensor—and? How did that happen? Let's look at the O'Reilly source from which Cory copy and pasted:

The phone has become a platform, moving beyond mere voice to smart mobile sensor—and back to phone again, by way of voice-over-IP.

Not really sure how to handle that (“garbage in, garbage out” and all that) but it's a lot better than things were before. All that was left was to add some more code to allow plain text or HTML formatted text and a preview mode; I put it online so those of you who are curious can play around with it.

The trick I learned (an epiphany if you will): I added the following to the code:

volatile int g_debug = 1;

while(g_debug)
  ;

That will cause the program to just sit there, doing vast amounts of nothing really fast. The reason for such a weird thing is that debugging a CGI program (and yes, this is written in C—don't ask) is not easy (I used to go through quite a bit of rigamarole to simulate the webserver environment so I could use a debugger). This trick allows the webserver to run the program (which will just sit there) and then I can then use gdb to attach to the running process to debug it (once in, I set my breakpoint, then do set g_debug=0 and resume execution of the program—wish I knew about this eight years ago).

Another amusing thing I learned—that the “/” character in Firefox will bring up a search box. It's not a bad thing, until you try typing a “/” in a <TEXTAREA> field. Then it gets right down annoying.

Now to take what I have and integrate it.

Obligatory Picture

Dad was resigned to the fact that I was, indeed, a landlubber, and turned the boat around yet again …

Obligatory Contact Info

Obligatory Feeds

Obligatory Links

Obligatory Miscellaneous

Obligatory AI Disclaimer

No AI was used in the making of this site, unless otherwise noted.

You have my permission to link freely to any entry here. Go ahead, I won't bite. I promise.

The dates are the permanent links to that day's entries (or entry, if there is only one entry). The titles are the permanent links to that entry only. The format for the links are simple: Start with the base link for this site: https://boston.conman.org/, then add the date you are interested in, say 2000/08/01, so that would make the final URL:

https://boston.conman.org/2000/08/01

You can also specify the entire month by leaving off the day portion. You can even select an arbitrary portion of time.

You may also note subtle shading of the links and that's intentional: the “closer” the link is (relative to the page) the “brighter” it appears. It's an experiment in using color shading to denote the distance a link is from here. If you don't notice it, don't worry; it's not all that important.

It is assumed that every brand name, slogan, corporate name, symbol, design element, et cetera mentioned in these pages is a protected and/or trademarked entity, the sole property of its owner(s), and acknowledgement of this status is implied.

Copyright © 1999-2025 by Sean Conner. All Rights Reserved.