The state of the art in general file compression is PPM -- compression by Partial Pattern Matching, with various tweaks. One (somewhat dated) overview of the subject may be found in On-Line Stochastic Processes in Data Compression 1996 156p (Suzanne Bunton, PhD Thesis, Univ of Washington Dept CSci & Eng), which is excerpted in Semantically Motivated IMprovements for PPM Variants 1997 18p (Suzanne Bunton). (Feed "Partial Pattern Matching" into Google for more stuff.)
2004-05-23 Some additional notes:
The initial problem was that PPMZ2 packs more than one byte of key per Context at tree levels 6 (three bytes), 7 and 8 (four bytes). This appears to be a -very- effective trick: Converting back to one byte per Context (and correspondingly increasing the maximum tree depth from 8 to 16) decreases compression ratio even after quadrupling ram pool size to compensate, plus slows the program down by a factor of two or more: It is not clear whether getting the 'successor' pointer hack working will produce a net win.
After the above, I spent a day trying to get the code debugged, without success. The successor nodes need to be created during context_Update() (that is, when their predecessor's are in hand), and to create a given successor node we need to have its parent already existing and in hand, which appears most easily done by via a self->parent->successor path. I believe this requires us to have context_Update() process nodes which it usually would not under update exclusion, at least to the point of creating successor nodes as appropriate. (Obviously, inappropriate statistics updating should not be done.) Also, it appears to me that this might require keeping Followset_Nodes (ContextNodes) in existence sometimes even if their 'count' is zero (due to statistics halving), since they may be needed to implement the sucessor mechanism.
Even after implementing (? -- perhaps I made a mistake) the above fixes, however, I was still assert()ing out on missing ContextNodes needed to implement the above self->parent->successor path. I'm a bit tired of of all this at the moment, so (since I'm doing this as a hobby) I'm going to take a break and Do Something Completely Different for awhile. When I pick it up again, I may try adding PPMZ2's ideas to PPMd rather than vice versa.
2004-05-15: pzip v0.83 released. Roughly 20% faster, due to more efficient gathering of active contexts among other things. Fairly major internal restructuring to make that happen, but otherwise no interesting changes.
2004-05-11: pzip v0.82 released. This release represents completion of the code clean-up and commenting phase: The source is now readable enough to perhaps be used for student projects or such. There is now no dead code, virtually every symbol has been renamed, and every function reformatted to impose a uniform coding style, and extensive commenting has been added both within the source proper and also in separate docfiles. It also runs over twice as fast, thanks to a little tuning during cleaning. Compression is just slightly better, thanks to reducing some integer round-off errors, but the emphasis to date has been on clean-up rather than improvement.
2004-04-29: Just for fun, I've ported Charles Bloom's PPMZ2 file compressor to Linux, building on Hannu Peltola and Jorma Tarhio's Linux port of PPMZ V1. After chopping dead code and cleaning up a bit, the source code size dropped from 23410 lines of C to 3864 lines -- an 83.5% shrink! I've also speeded it up by 19%. :)
I've renamed it pzip to prevent confusion and better fit into the Linux naming convention set by gzip and bzip. The source tarball for pzip v0.81 is here. Nothing very fancy: Unpack, do "make check", and if anything goes wrong, well, I overlooked something. :)