File Compression

The state of the art in general file compression is PPM -- compression by Partial Pattern Matching, with various tweaks. One (somewhat dated) overview of the subject may be found in On-Line Stochastic Processes in Data Compression 1996 156p (Suzanne Bunton, PhD Thesis, Univ of Washington Dept CSci & Eng), which is excerpted in Semantically Motivated IMprovements for PPM Variants 1997 18p (Suzanne Bunton). (Feed "Partial Pattern Matching" into Google for more stuff.)

2004-05-23 Some additional notes:

2004-05-15: pzip v0.83 released. Roughly 20% faster, due to more efficient gathering of active contexts among other things. Fairly major internal restructuring to make that happen, but otherwise no interesting changes.

2004-05-11: pzip v0.82 released. This release represents completion of the code clean-up and commenting phase: The source is now readable enough to perhaps be used for student projects or such. There is now no dead code, virtually every symbol has been renamed, and every function reformatted to impose a uniform coding style, and extensive commenting has been added both within the source proper and also in separate docfiles. It also runs over twice as fast, thanks to a little tuning during cleaning. Compression is just slightly better, thanks to reducing some integer round-off errors, but the emphasis to date has been on clean-up rather than improvement.

2004-04-29: Just for fun, I've ported Charles Bloom's PPMZ2 file compressor to Linux, building on Hannu Peltola and Jorma Tarhio's Linux port of PPMZ V1. After chopping dead code and cleaning up a bit, the source code size dropped from 23410 lines of C to 3864 lines -- an 83.5% shrink! I've also speeded it up by 19%. :)

I've renamed it pzip to prevent confusion and better fit into the Linux naming convention set by gzip and bzip. The source tarball for pzip v0.81 is here. Nothing very fancy: Unpack, do "make check", and if anything goes wrong, well, I overlooked something. :)


Last modified: Sun May 23 19:15:10 CDT 2004