Tuesday, April 01, 2008
Hey! What happened to the Ides of March?
Oh man, is it April First already?
XXXX!
I never did get around to mucking with the stylesheets this year (unlike the previous four years).
Sigh.
Maybe next year.
Compression fever
It's one thing to trick someone on April Fools' Day, but to trick yourself really takes talent.
I have that talent.
Sigh.
I started this last week and didn't suspect a thing.
I came across The Hutter Prize, which will pay up to 50,000€ to compress a 100,000,000 byte file to less than 16,481,655 bytes (which includes the decompression program). The rules are straightforward, and to me, it certainly seems easier than The Netflix Prize.
At the very least, it couldn't hurt to play around a bit.
So last Thursday I downloaded the file to be compressed and played around with it. The file itself is nothing more than Wikipedia entries wrapped in XML. The XML only makes up 2.4% of the file—the rest is text (13.5% is nothing but the space character; 8% is the letter “e”).
Friday I started coding. Nothing difficult and it was pretty
straightforward (although while the algorithm for Huffman encoding is
pretty simple, writing it was surprisingly difficult; it took about
five attempts before I was happy with the code). I did a few benchmarks
against gzip
and bzip2
and was surprised with the
results:
Program | Size of archive |
---|---|
orginal file | 100,000,000 |
gzip | 36,445,248 |
bzip2 | 29,008,736 |
my attempt | 19,774,963 |
While not less than the required 16,481,655 bytes, it is in the
ballpark, and I was quite surprised to beat the snot out of
bzip2
.
Not bad for a few hours of work.
So today, I'm hacking around with my program when I notice something odd—some of the Huffman encodings are the same.
That's not a good sign—each encoding should be unique. It takes a while (it takes almost 7½ minutes to compress 100,000,000 bytes with my program) but I find the problem—a bug in the encoding routine. Building the Huffman encoding table was fine—it was reading the encoded values that was buggy (and dropping bits).
How buggy?
Program | Size of archive |
---|---|
orginal file | 100,000,000 |
gzip | 36,445,248 |
bzip2 | 29,008,736 |
my attempt | 35,947,593 |
Um … yeah.
It's not the first time my hopes for vast fortunes were dashed because of a bug fix.
Sigh.