The Boston Diaries

Tuesday, April 01, 2008

Hey! What happened to the Ides of March?

Oh man, is it April First already?

XXXX!

I never did get around to mucking with the stylesheets this year (unlike the previous four years).

Sigh.

Maybe next year.

Compression fever

It's one thing to trick someone on April Fools' Day, but to trick yourself really takes talent.

I have that talent.

Sigh.

I started this last week and didn't suspect a thing.

I came across The Hutter Prize, which will pay up to 50,000€ to compress a 100,000,000 byte file to less than 16,481,655 bytes (which includes the decompression program). The rules are straightforward, and to me, it certainly seems easier than The Netflix Prize.

At the very least, it couldn't hurt to play around a bit.

So last Thursday I downloaded the file to be compressed and played around with it. The file itself is nothing more than Wikipedia entries wrapped in XML. The XML only makes up 2.4% of the file—the rest is text (13.5% is nothing but the space character; 8% is the letter “e”).

Friday I started coding. Nothing difficult and it was pretty straightforward (although while the algorithm for Huffman encoding is pretty simple, writing it was surprisingly difficult; it took about five attempts before I was happy with the code). I did a few benchmarks against gzip and bzip2 and was surprised with the results:

My Sooperseekrit compression algorithm vs. `gzip` and `bzip2`
Program	Size of archive
orginal file	100,000,000
`gzip`	36,445,248
`bzip2`	29,008,736
my attempt	19,774,963

While not less than the required 16,481,655 bytes, it is in the ballpark, and I was quite surprised to beat the snot out of bzip2.

Not bad for a few hours of work.

So today, I'm hacking around with my program when I notice something odd—some of the Huffman encodings are the same.

That's not a good sign—each encoding should be unique. It takes a while (it takes almost 7½ minutes to compress 100,000,000 bytes with my program) but I find the problem—a bug in the encoding routine. Building the Huffman encoding table was fine—it was reading the encoded values that was buggy (and dropping bits).

How buggy?

My *fixed* Sooperseekrit compression algorithm vs. `gzip` and `bzip2`
Program	Size of archive
orginal file	100,000,000
`gzip`	36,445,248
`bzip2`	29,008,736
my attempt	35,947,593

Um … yeah.

It's not the first time my hopes for vast fortunes were dashed because of a bug fix.

Sigh.

Tuesday, April 01, 2008

Hey! What happened to the Ides of March?

Compression fever

Obligatory Picture

Obligatory Contact Info

Obligatory Feeds

Obligatory Links

Obligatory Miscellaneous

Obligatory AI Disclaimer