What Others Know About Compression That You Don't

20 Aug 2012 Ah, the compression formats. Always a mystery. Always just something we will use, but something we never understand. For one thing, this makes sense, considering Wikipedia alone lists 15 different compression formats. There are the well-known bz2, gz and LZMA algorithms. And the less well-known rz, ?Q? and ??_. Now that last one makes me happy this isn't a podcast... Though for this post we will concentrate on Deflate. Deflate is not only used in gzip and zip, but is also a fan favorite to compress web assets when sent to the browser.

The Basic Building Blocks

Underneath Deflate uses LZ77 combined with a Huffman encoding. LZ77 is used to detect duplicate strings and insert a so-called back reference (more on this later). Huffman Encoding is similar to ASCII, but instead of always using 8 bits, it stores the most used characters in 3 bits, and the least used ones, with, well, more bits. More on this also later.

Come On, Deflate Already!

First, Deflate chops up all parts into blocks to make processing easier. Then it decides in which mode it should encode that block:

Store raw stream
Store with a static Huffman encoding
Store with a dynamic Huffman Encoding

Now, it still doesn't deflate. But lets see. First mode is raw mode. In raw mode, Deflate decides that the data is already compressed and it can't do a better job. So it just inserts its header to say "I can't do anything with that, this is already compressed, dummy!". That's why, the next time a coworker shouts at you why you didn't ZIP these files, you can shout back: "D'oh, mode 1, man!". Second mode is static Huffman mode. This just means that Deflate decides that the data is standard enough that the standard Huffman Encoding table should be used. For example useful for English text. This way Deflate doesn't have to add a table as overhead to the compressed content. Third is dynamic Huffman mode. With dynamic Huffman mode, the content gets compressed with a Huffman Encoding, but the encoding table is generated based on the input. So for example, in Finish, the shortest code would probably be "ö", or something like that. In @zedshaw's recent tweets it could be "C", "F", "K" and "U", who knows! But that leaves us with two question: Huffman Encoding, wut? And LZ77?

Huffman Encoding

Huffman Encoding is something that a guy called Huffman developed while getting a Ph.D from ~~MTI~~ MIT. Nothing surprising until now! The paper on it is actually from 1952. Damn it! That's double my age! That paper should feel old! What it basically says: Tally up all individual characters. The most used one gets the shortest code (at 3 bits). Then continue down until you arrived at the end. Actually, Wikipedia does a better job at that than I ever could, so here's a picture: [caption id="attachment_241" align="alignnone" width="600"]

Huffman Coding Tree[/caption] (Source: http://en.wikipedia.org/wiki/File:Huffman_tree_2.svg) So that one sounded scary, turned out to be really easy!

So What About That LZ77

The next part in this tour around compression algorithms leads us right back into history. Can't we get something new going in the compression world? LZ77 was introduced in a paper in, well, 1977 by Lempel and Ziv. LZ77 is a so-called sliding window compressor. At least they came up with impressive names back then. Why sliding window? It checks the last 2, 4 or 32 kB for repeating strings. Simple as that. Should it find one such repeating string, instead of adding that string to the output stream, it puts a back marker in that basically says: If you want to finish reading that sentence, go back to page 32, line 17, and read from character 17 to character 34. That lazy bastard!

Now I See How They Tie Together

Well I hope you do! Else I failed and you should shame in the comments. :) Or of course I can try again: First, Deflate splits up the document into blocks to make working with it easier. Then (lets assume the input isn't already compressed) it runs each block through LZ77 to get all the obvious double strings out of the way. Then it sends it through a Huffman Encoding to make it even shorter. So yes, that's deflate.

What About The Others (BZIP2 Anyone?)

Not all of them use this simple approach, because simple apparently isn't good enough. BZIP2 goes through 9 different stages.

Replacing all 4 or more repeating characters with just 4 characters. (There you see, a geek had to make this algorithm for comic books! "AAAAAAAAAA!" is so much longer than "AAA\7!") So called Comic Book Encoding... Unfortunately not, it's simply called Run Length Encoding. How boring!
Next they wheel around the characters. Which basically means a fancy way to sort the same letters after each other so compression is easier. The so called Burrows-Wheeler Transform.
Replace all repeating characters with 0, so we have as many 0 as possible. Could also be any other character. The Move-to-Front Transform (Yes, I hate these people too at concerts).
And again, after sorting everything possible, we do a Run-length Encoding.
Then a Huffman Encoding. Don't tell me now I have to explain what that is!
Then again we see Huffmaaa\7n! This time with multiple tables per block to make the output even shorter.
8. and 9. Add some binary trickery.

So yes, you can use a lot of other things, but it all comes back to making the output even shorter. Surprising, hu?

So That's Compression

Yes, that's compression. Use a bit of special encoding, remove duplicates, and you are halfway there. Of course, this is just for Deflate. We saw that BZIP2 is already a lot more complicated and uses a lot of different tricks to make the output even smaller, but then again, it is also slower! If you liked this post, you should follow me on Twitter and sign up for the RSS feed! B58P3MMWCDTH