The Boston Diaries

The ongoing saga of a programmer who doesn't live in Boston, nor does he even like Boston, but yet named his weblog/journal “The Boston Diaries.”

Go figure.

Tuesday, April 01, 2008

Compression fever

It's one thing to trick someone on April Fools' Day, but to trick yourself really takes talent.

I have that talent.

Sigh.

I started this last week and didn't suspect a thing.

I came across The Hutter Prize, which will pay up to 50,000€ to compress a 100,000,000 byte file to less than 16,481,655 bytes (which includes the decompression program). The rules are straightforward, and to me, it certainly seems easier than The Netflix Prize.

At the very least, it couldn't hurt to play around a bit.

So last Thursday I downloaded the file to be compressed and played around with it. The file itself is nothing more than Wikipedia entries wrapped in XML. The XML only makes up 2.4% of the file—the rest is text (13.5% is nothing but the space character; 8% is the letter “e”).

Friday I started coding. Nothing difficult and it was pretty straightforward (although while the algorithm for Huffman encoding is pretty simple, writing it was surprisingly difficult; it took about five attempts before I was happy with the code). I did a few benchmarks against gzip and bzip2 and was surprised with the results:

My Sooperseekrit compression algorithm vs. gzip and bzip2
ProgramSize of archive
orginal file 100,000,000
gzip 36,445,248
bzip2 29,008,736
my attempt 19,774,963

While not less than the required 16,481,655 bytes, it is in the ballpark, and I was quite surprised to beat the snot out of bzip2.

Not bad for a few hours of work.

So today, I'm hacking around with my program when I notice something odd—some of the Huffman encodings are the same.

That's not a good sign—each encoding should be unique. It takes a while (it takes almost 7½ minutes to compress 100,000,000 bytes with my program) but I find the problem—a bug in the encoding routine. Building the Huffman encoding table was fine—it was reading the encoded values that was buggy (and dropping bits).

How buggy?

My fixed Sooperseekrit compression algorithm vs. gzip and bzip2
ProgramSize of archive
orginal file 100,000,000
gzip 36,445,248
bzip2 29,008,736
my attempt 35,947,593

Um … yeah.

It's not the first time my hopes for vast fortunes were dashed because of a bug fix.

Sigh.

Obligatory Picture

[It's the most wonderful time of the year!]

Obligatory Links

Obligatory Miscellaneous

You have my permission to link freely to any entry here. Go ahead, I won't bite. I promise.

The dates are the permanent links to that day's entries (or entry, if there is only one entry). The titles are the permanent links to that entry only. The format for the links are simple: Start with the base link for this site: http://boston.conman.org/, then add the date you are interested in, say 2000/08/01, so that would make the final URL:

http://boston.conman.org/2000/08/01

You can also specify the entire month by leaving off the day portion. You can even select an arbitrary portion of time.

You may also note subtle shading of the links and that's intentional: the “closer” the link is (relative to the page) the “brighter” it appears. It's an experiment in using color shading to denote the distance a link is from here. If you don't notice it, don't worry; it's not all that important.

It is assumed that every brand name, slogan, corporate name, symbol, design element, et cetera mentioned in these pages is a protected and/or trademarked entity, the sole property of its owner(s), and acknowledgement of this status is implied.

Copyright © 1999-2019 by Sean Conner. All Rights Reserved.