A Comparison of PPMZ2 and PPMd

From: Charles Bloom 
Subject: Re: PPMd: Wheels to steal
To: Cynbe ru Taren 
Date: Sat, 22 May 2004 23:06:00 -0700

Wow, that's good stuff, you should put this on the web.  The PPMd
source code isn't actually available, is it?  The successor pointer is
great.

That's amazing that they don't do LOE, or blending in the SEE, both of
those were huge wins for me.  I guess they do them in PPMonstr and get
even better compression there.

>When a new ContextNode is created, rather than simply
>initializing its count to 1, they derive an estimate
>from the parent Context, which is basically its 'count'
>in the parent context rescaled to reflect the difference
>in totSymCount between current and parent Context -- but
>by the time they are through with the math, the formula
>has 14 terms, and I'm not about to attempt formatting it
>here in ascii!

This is the most interesting bit to me.  This is something that has
been tried since the 80's.  Moffat, Cleary & Witten tried it in the
original PPMA/PPMB work !!!  It was found that just restarting a new
empty context was better than copying from the parent.  So, clearly,
there's something subtle here.  Like, I suppose if the parent clearly
has some dominant symbol, you should carry some of that count over to
the new context, and there should be some general formulation for this.

As for the SEE, maybe they've found some very good conditioning
values.  In PPMZ I was just sticking everything I could think of in
the SEE, but it works better if you have a tight conditioning hash.
In fact, making it a smaller hash helps compression if you can still
capture the key values.

At 12:19 AM 5/21/2004 -0500, you wrote:

>Charles Bloom  writes:
>
> > Yeah, I have no idea how he gets PPMD so fast.
>
>Ok, I just devoted several days to studying PPMd and the paper
>on it:
>    http://datacompression.info/Miscellaneous/PPMII_DCC02.pdf
>
>I won't claim to totally grok it all yet, but I do have a pretty
>good general understanding of both -- enough to work out the
>remaining details pretty easily as they become relevent.
>
>Executive summary:
>
>     The Dmitry Duo are very sharp mathematically inclined folks who have
>     designed every aspect of PPMd from algorithm to statement phrasing
>     details with the sole aim of running faster than bzip2 while still
>     compressing better than it.
>
>     They have been so successful that they actually outcompress PPMZ2 as
>     well as being faster than it, by coming up with a a number of
>     innovations worth backporting into PPMZ2.  (Given the tiny time and
>     space budget they are allowing PPMd, PPMZ2 clearly -should-
>     outcompress it: More CPU cycles and RAM megabytes do count for
>     something, everything else being equal.)
>
>     From PPMZ2's perspective, probably the biggest, easiest win to
>     backport is their idea of (in PPMZ2 terminology) keeping a 'successor'
>     pointer in each ContextNode to the next Context (or highest-order-
>     currently-existing Context) to use:  For example, in the
>     ContextNode 'a' for Context "abracad", we'd have a pointer
>     to Context "abracada".
>
>     In favorable cases, this almost entirely eliminates the work done by
>     ContextTrie_GetNodes: We get the Order-8 Context to use by following
>     the above 'successor' pointer, and then pick up the remaining Contexts
>     just by following the Context->parent pointers: We fill out
>     contexts[PPMZ2_Order+1] at the rate of about two memory cycles per
>     slot.
>
>     That's the kind of idea that makes you slap your forehead and say
>     "Duh!  Obviously!" :)
>
>
>
>
>Overview
>--------
>
>If PPMZ2 is a semi tractor, focused purely on compressive power,
>then PPMd is a dragster, focused purely on speed.
>
>If PPMZ2 is an epic poem, then PPMd is a haiku -- the entire
>compression engine fits in 660 lines of C++, written in the
>mathematical style of "he who fits the most code in a screenful
>-- WINS!"  (The memory allocator and the range encoder modules
>add a few more lines of code, as does the module which does
>commandline parsing &tc, but it is still a pretty small program
>all told.  I can fit the whole compression engine on my
>display at one time.  Of course, my display is 4800x2400 pixels. :)
>
>
>To achieve PPMd's speed, the Dmitrys have imposed the design goal of
>doing everything in O(1) time.  In particular, they want to process
>each symbol by looking at only one Context, one SEE entry &tc &tc.
>
>This means no entropic blending in the SEE code, no LOE &tc &tc &tc.
>
>Obviously, a program which doesn't have those restrictions will
>compress better than one which does, everything else being equal.
>That they in fact outcompress PPMZ2 at the moment has to be due to the
>various nice little innovations they've introduced which PPMZ2
>currently lacks, but which would benefit it at least as much.
>
>
>
>Specifics
>---------
>
>Here I'm just going to list a grabbag of (interesting, I hope)
>differences between PPMZ2 and PPMd:
>
>
>
>At the low detail level, the Dmitrys substitute boolean expressions
>for if-then-elses whenever possible, undoubtedly because modern
>deeply pipelined processors passionately hate branches.
>
>Thus, instead of writing
>
>    if (a < b) { c += 0.3 }
>
>they will write
>
>     c += (a < b) * 0.3;
>
>Any good C compiler will implement the above line entirely
>without jumps, allowing those deep Pentium pipelines to keep
>flowing smoothly.
>
>
>
>The Dmitrys use a cute little range-coder one of them
>wrote instead an arithmetic encoder.  This avoids the
>jiggery-pokery of handling carries (nicer esthetically)
>and is claimed to run twice as fast at a loss of only
>0.01% in file compression efficiency, which I find easy
>to believe.  The code is public domain, patent-free, and
>an apparently well-tested C translation is floating
>around the web.
>
>I don't think the encoding speed gain will ever matter to
>PPMZ2, which is oriented towards doing a lot more work per
>symbol in search of maximum compression, so I'd be inclined
>to stick with the current arithmetic encoder, although the
>sheer elegance of the range coder is tempting.  (Switching
>from 32- to 64-bit arithmetic should drop the compression
>penalty of the range-coder by a few more decimals, however,
>making adopting it perfectly practical for PPMZ2.)
>
>
>
>The Dmitrys are fanatic about minimizing use of ram.
>For example, Their equivalents of ContextNode.sym and
>ContextNode.count are one byte each.  That makes sense
>given their design goals, but I don't see any reason
>for PPMZ2 to follow their lead in this respect.
>
>
>
>As a curiosity, the Dmitrys even attempt to prefetch values
>into cache before they are needed, in the hopes of wringing
>out a small additional increment of speed: They have a macro
>which does nothing but read from an address and then discard
>the result.  Their comment at its point of definition is
>"Mainly to puzzle." :) I would definitely not recommend
>copying this innovation: It clutters the code badly for (I
>strongly suspect) a vanishingly small return in speed --
>quite possibly a negative one, due to diluting the code
>cache with unproductive code.
>
>
>
>PPMd stores all the ContextNodes for a given Context in a
>packed vector rather than a linklist.  This saves the ram
>needed for a 'next' pointer, and also plays nicer with
>modern cache memory hardware, which is oriented towards
>optimizing scans through sequential memory addresses.
>
>IMHO, this is worth doing.
>
>They don't do LRU (move-to-front) on this list:  Instead,
>they keep it sorted by ContextNode.count, which should
>actually be a bit better.
>
>
>
>They don't do any clean LRU-recycling type of stuff when
>they exceed their self-imposed ram limits: They mostly just
>dump the model and start over, although they have hackish
>options to, say, just freeze the model at that point.  The
>various options offered would seem to indicate they they
>aren't overly happy with any of them.  I'm sure that what
>is going on is that they don't want to spend the space and
>time to implement LRU, so they're left with unpalatable
>hacks.
>
>
>
>They keep ContextNode.count values to 1/4 unit precision
>(rather than just the usual units precision), rescaling when
>it reaches 128 raw (== 32 units).
>
>
>
>When a new ContextNode is created, rather than simply
>initializing its count to 1, they derive an estimate
>from the parent Context, which is basically its 'count'
>in the parent context rescaled to reflect the difference
>in totSymCount between current and parent Context -- but
>by the time they are through with the math, the formula
>has 14 terms, and I'm not about to attempt formatting it
>here in ascii!
>
>They call this 'information inheritance'.  I have the
>impression they consider it their most significant
>contribution.
>
>They note that it is really best done not when the new
>ContextNode is -created-, but rather the first time it is
>-used- -- because by then the parent Context is likely to
>have slightly more accurate statistics.  As a speed hack,
>they do this only for the first symbol in a new context,
>if I understand correctly.
>
>
>
>Because of the above increased use of parent-Context
>statistics (they say), neither full-update nor conventional
>update-exclusion works well for them: Instead when updating
>they do a 1/2 - unit update in the first parent Context that
>would have been left untouched under normal update
>exclusion.  They skip doing this on leaf contexts (Order 8
>for PPMZ2) mostly as a speed hack.  They also stop doing
>this once the count in the current context reaches 8, to
>avoid imbalancing the parent context too much.  (They say
>the '8' was derived purely empirically.)
>
>
>
>Since their design goals preclude entropic blending of SEE
>states, they restrict themselves to quite small sets of SEE
>states.  They have however obviously put a -lot- of time and
>effort into thinking up and evaluating dimensions (tests)
>for these states, in order to use their limited resources as
>effectively as possible.  The part of their paper covering
>this is relatively readable (for me at least, some of the
>other parts -- heavy on math notation and Russian English --
>are quite heavy slogging) but what the program is actually
>doing appears to be significantly different. :)
>
>o With some of these statistics, rather than halving
>   periodically, they like maintaining a short-period
>   running average:  On each update, they add in the
>   new value and subtract off the mean.
>
>o They have a special hacks for faster convergence
>   at the start.
>
>o They divide SEE estimation into three cases:
>
>   .  'binary', where the Context contains only one symbol
>      -- sounds a lot like 'deterministic' Contexts --
>      which is where they say 60-80% of symbol encodings
>      start.
>
>   .  'masked' ("m-context"), where some of the symbols
>      are masked out due to exclusion from a higher-order
>      Context.
>
>   .  'not masked' ("nm-context"), which is everything else.
>
>   They use completely different escape probability estimation
>   schemes for each of these cases:
>
>   .  'binary':  The paper describes use of a 8192-bucket SEE
>      statistics scheme for this case, but the code actually
>      shows a 25x64 (1600-bucket) table, so I'll describe what the
>      code does, near as I have deciphered it so far:
>
>         circa 4.5 bits:  The [25]-long dimension codes the
>                          ContextNode.count for the single symbol
>                          present in the context, recoded down
>                          something like logarithmically (by table
>                          lookup) from the raw 8-bit value.
>
>         1 bit:           "run length".  Probably not what you think. :)
>                          They assert that input text tends to divide
>                          into predictable blocks (like English words),
>                          and that PPM type algorthms perform badly
>                          when the length of such a block exceeds the
>                          depth D of the context tree.  This bit tries
>                          to measure the length of such a block, presumably
>                          so that the statistics gathered in such "bad"
>                          cases don't contaminate those in other cases.
>                            In the paper, they say the criterion for
>                          such a block is that there were no escapes
>                          to lower-order Contexts, and that all symbols
>                          coded had probabilities > 0.5.
>                            That appears to be roughly what the code
>                          is doing, except that they cap the length of
>                          the block at 12 when the order goes higher
>                          (they've experimented with orders up through 64.)
>                            As a tactical note, they save a cycle or two
>                          by initializing their counter to a negative value
>                          and then using the sign bit, which will go positive
>                          at the desired count: This avoids doing an explicit
>                          Boolean compare op, much less a jump.
>
>        2 bits:           In PPMZ2 terminology Context->parent->numSyms coded
>                          down to two bits ("1,2,3 -- many!"): Number
> of symbols
>                          in next lower order Context.
>
>        1 bit:            "PrevSuccess":  Per "run length" above, this is 1
>                          if the last symbol was coded without escapes at
>                          probability > 0.5.
>
>        1 bit:            Set if the previous symbol coded was >= 0x40:
>                              Flag=0x08*(FSymbol >= 0x40);
>                          (Another example of them finessing a branch
>                          by using a boolean.)
>
>        1 bit:            Same as above, for "ContextNode.sym is the sole
>                          symbol in the current Context".
>
>
>   .  'm-context':  Context with "masked" (excluded) symbols.  This uses a
>                    separate statistics table half the size of the above one:
>                    24x32, with a hash function:
>
>         circa 4.5 bits:  The [24]-long dimension codes the
>                          Context.numSyms for the current Context,
>                          recoded down something like logarithmically
>                          (by same table ab above) from the raw 8-bit value.
>
>         1 bit:           Is the average ContextNode.count > 11?
>                              "SummFreq > 11*(NumStats+1)"
>
>         1 bit:           Do we have many or few unmasked symbols, compared to
>                          the number of symbols in our parent?
>                              "2*(2*NumStats < Suffix->NumStats+NumMasked)"
>                          or in PPMZ2 terminology:
>                               2*(2*this->numSyms <
> this->parent->numSyms+E->numSetSlots)
>                          (Yeah, E->numSetSlots is a count not
> currently maintained, but
>                          you get the idea.)
>
>                          The remaining three bits come from 'Flags',
> which seems to
>                          contain different stuff in different
> situations, and I'm not
>                          not yet sure just what it contains in this
> one, but it is
>                          -something- like:
>
>        1 bit:            Set if the previous symbol coded was >=
> 0x40 (as in 'binary')
>
>        1 bit:            If the statistics in this Context have ever
> been rescaled (halved).
>
>        1 bit:            If the previous symbol was coded without an
> escape (i.e., in max-length
>                          available context, since no LOE is done).
>
>
>
>   .   'nm-context' ("non-masked context"): one with no excluded symbols.
>       At the moment all I can really say is that I don't understand the
>       explanation in the paper (which seems quite incomplete) nor what
>       the code is doing.  It doesn't involve a separate SEE lookup table
>       or a hash like the above, just values maintained in the Context
>       over time.
>
>
>
>
>
>That's pretty much it for PPMd proper.  They had a bunch of
>ideas left over that didn't fit within PPMd's design
>time+space consumption profile, so they also released a
>"PPMonster.exe" which includes these.  I've been paying
>less attention to this stuff, but it includes:
>
>  o  Statistics in young Contexts are subject to a lot
>     of statistical noise, so they present (separate)
>     formulae for correcting both the more and less
>     probable symbols.  The former are done via weighted
>     combination with the matching value in the parent
>     Context.  The latter are done by incrementing by
>     3/4 rather than 1 for uncommon symbols in very
>     young contexts.
>
>  o  They suggest another tweak or two to making adaptive
>     mean estimates in SEE &tc less subject to quantization
>     noise.
>
>  o  They suggest working harder on Most Probable Symbol
>     (MPS) probability estimation by setting up a SEE-style
>     table (which they call SSE, "Secondary Symbol probability
>     Estimation") for this problem, with separate subtables
>     for contexts which are masked and unmasked (that is,
>     with or without excluded symbols).  I won't repeat the
>     hash functions they list here.
>
>  o  They suggest adding to the SEE (not SSE) hashes tests
>     such as:
>        o  is Context.numSyms larger or smaller than that of
>           previous Context?
>        o  How many "binary" (deterministic) parents does
>           Context have?  (Mashed to fit in two bits.)
>
>  o  Finally, they suggest a SEE table for non-masked
>     contexts largely parallel to the ones above for
>     the other two cases, with hash tests including
>        o  Context.numSyms mashed down to 0-24 range
>        o  Whether 4*Context.numSyms > 3*Context->parent.numSyms
>           (that looks backwards to me *shrug*)
>     plus a grabbag of six more bits taken from the
>     above tables.
>
>I haven't looked at the code corresponding to the above
>suggestions (I've read that it isn't available), so I
>don't know if they are accurate:  Given the major
>differences between the documented and actual SEE hashes
>in PPMd proper, one might be inclined to be cautious.
>
>
>
>I'm sated with (well -- sick of!) reading code and papers,
>and eager to get back to coding, so I'm going to try
>backporting some of that stuff.  The 'successor' pointer
>idea alone should be good for something like a 10-30%
>speedup.
>
>I'll keep you informed.
>
>  -- Cynbe
>
>BTW, it has been pointed out to me that the name 'pzip' is
>already very heavily used.  Sure enough.  So I tried all of
>[a-oq-z]zip and found them likewise already spoken for.
>Bother.  Maybe I should call my PPMZ2 derivative ppmzip or
>zipppm or muqzip something.  Wonder if "zippy" is taken? :)

--------------------------------------------------------------------------------------------
Charles Bloom    email "cb"    http://www.cbloom.com
Cynbe ru Taren
Last modified: Sun May 23 13:54:35 CDT 2004