a -- Frequency = 3... Probability = 3/6... Entropy = -lg(3/6)
= 1 bit.
b -- Frequency = 2... Probability = 2/6... Entropy = -lg(2/6)
= 1.59 bits.
c -- Frequency = 1... Probability = 1/6... Entropy = -lg(1/6)
= 2.59 bits.
If we just use the entropies, (3*1)+(1.59*2)+2.59 = 8.77 bits. So what happens if we try to monkey with the probability distribution so that the most frequent gets a bigger slice of the pie? Can we beat the entropy by changing individual entropies with fake ones?
a -- Probability = 3/6... Fake = 9/12... Entropy = -lg(9/12)
= 0.42 bits.
b -- Probability = 2/6... Fake = 2/12... Entropy = -lg(2/12)
= 2.59 bits.
c -- Probability = 1/6... Fake = 1/12... Entropy = -lg(1/12)
= 3.59 bits.
Well, the entropy of the most frequent was brought down, so hopefully
we should have an improvement. But we're human, so hope is often
a pipe dream.
(3*0.42)+(2.59*2)+3.59 = 10.03 bits. Obviously, just putting
fake probabilities into the system is just not very effective. The
most common symbol went in with fewer bits, but it still didn't help because
the others grew too much. So to really improve compression beyond
arithmetic coding, we need a model that actually brings up probabilities
of EVERYTHING (at least on average), and thus achieve higher compression.
It's fairly safe to refer to finite context models as Markov models...
it's just that I'd prefer that no one confuses them with finite state models.
There do exist models for infinite-order Markov chains, but that's hardly
practical. It's doable in the sense that an entire file can be one
of the contexts, in which case, we're at a point where there is no more
data (so it is at the absolute limit... very much a hand-waving definition
of "infinite"). Since it IS effectively a Markov, we can really show
that there
is no real way to improve on the computed entropy simply by using the
Kraft inequalities. One of the deep dark secrets that we've all discovered
after all the variants on ppm have been developed is that contexts of 8th
order tend to work best in most cases... better than everything except
infinite order. Infinite order contexts, however, require absurd
amounts of memory and the cheapest way to implement them is through reinforcement
learning of sorts... but the problem with using reinforcement learning
is that it can solve infinite order chains, but not the [1..infy-1]-order
chains. It is effectively a blended
infinite-order context model. While order-8 models achieve the
best compression of the data itself, it's often been found that anything
beyond order-5 is not worth it. The performance hit beyond order-5
is too great for a very small gain. Moreover, you open yourself up
to very high header costs as you will
need many more probability distribution tables as orders get higher.
At 9th order contexts or so, the increase in bandwidth requirements for
the probability tables is almost always greater than the gain in comopression,
and files sizes start to go up. Depending on the data, this can even
happen as low as 3rd order models, but in general, order-9 is about where
it's almost always the case.
Basically, the problem here is that we've
assigned high probability only to the single most probable symbol.
To really improve compression, the most common symbols (note plural) need
have higher probability. If we look for character occurences on the whole,
we're probably not going to get anywhere near as high of probabilities
as when we look at contexts. Note that we ARE dealing with real probabilities
and not ones that have been pushed in a fake and dirty manner. The
fact that they are real probabilities means that high probability symbols
not only take up less data, but also occur more often by a similar scale,
so we can expect low data counts rather often. The context of a given
symbol is the finite length sequence of symbols before or after it. In
finite context modeling we use the symbols before it. Because we can be
sure that the decompressor's model will have this information. That ensures
that it can have the same probabilities as the compressor's model and there
would be no problem decoding the symbols. In the other hand if the compressor
and decompressor have different probability distributions, the same code
may mean different things, and thus we would get corrupted data. Due to
this fact, a rule which you must follow is that both compressor and decompressor
must have the same model always -- if you want to do lossless compression,
that is...
If I took a context like "whi," we could probably
say that the letter "j" has SOME probability of following that string.
But the most likely characters to follow would be "t" (white) or "c" (which).
There is an lower but existent probability for the next character to be
an "r" (whirl)... not as high as "white" or "which," but certainly
higher than "whiqx" or some other outlandishly improbable string.
In the previous example model, the probability of a symbol was taken from
the number of time that it appeared in the file, when using the information
of context we also have to take care of what is the context of the current
symbol, and thus we take only the probability of the symbols which have
appeared under this context. Both text and binary data are context-sensitive,
that is, under a given context the same symbols tend to appear, so we can
get a more reliable probability. Binary data tends to be context
sensitive, however it exhibits a more random structure than text.
This is especially true of executables and less true of binary data that
can potentially be compressed in a lossy fashion. Lossily compressible
data can be fit well by alternative models.
Let's take an example like the following --
"the text"... We are currently at the very last symbol, in this case,
a 't.' The other times that we saw a t, we had either an h or an
e following it. So that means that now that we've seen another t,
the probability that the next character is an 'h' is 1/2. The probability
that the next character is an 'e' is also 1/2. In the case of ppm
variants, we'd also include space in the probability distribution for an
"escape code" -- i.e., something else that doesn't otherwise fit in our
model. If we didn't take context information into account, the probability
of an 'e' would have been 1/4, or an entropy of 2 bits... for that
matter, the probability of an 'h' would have been 1/8, which translates
to an entropy of 3 bits. With the context information, we've cut
both down to 1 bit each. To actually implement this, we can't really
use a single probability table. We really need N tables. Since
our context sizes are only 1 character wide, N is just the alphabet size.
Depending on the context, we'd simply choose one. In this example,
we'd have an alphabet size of 5 (or 6 if we include an escape code).
Obviously, if we were playing with order-3 models, our contexts would be
3 characters wide, and we might have more tables. The reason I say
"might" is because not all possible contexts will exist in a given file.
There's also a special order used in ppm, called order-(-1), which is just
a flat probability distribution for all the symbols.
Once we have different orders the question
of how to use them arises. One of the solutions is blending, which is simply
combining the probability of all the orders in a single probability for
every symbol. This is done by adding together the probability of the same
symbol under different contexts. However as we have seen probabilities
of higher orders tend to be more reliable and thus achieve higher compression,
the way we reflect this in blending is by weighting higher orders, that
is multiplying them to give them more importance when they are added together.
Every context has a different weight which we can choose beforehand, or
change while compressing.
Again, another example... the string
this time is "baabkab" and this time, we've got an order-2-1-0 context
model. Let's take the probability distribution from each order and
weight them 8, 2, and 1... Then these distributions will be blended and
we'll see what we get.
Order 2... context "ab" -- frequencies : a = 0, b = 0, k = 1
Order 1... context "b" -- frequencies : a = 1, b = 0, k
= 1
Order 0... null context -- frequencies : a = 3, b = 3, k = 1
Note that there are no context like "ba" because we've actually computed the probability of them occurring in the "b" context alone, so there's no need to compute a second dependency as it wil actually rob us of some compression. If the model were just order-2 as opposed to order-2-1-0, we wouldn't bother with that sort of combining. In practice, we'll actually use order-2 models because of performance and because of the fact that we'll only lose compression a few symbols at a time. Anyway, let's apply the weights and move on...
Order 2... weighted frequencies : a = 0, b = 0, k = 8
Order 1... weighted frequencies : a = 2, b = 0, k = 2
Order 0... weighted frequencies : a = 3, b = 3, k = 1
Total : a = 5, b = 3, k = 11
So putting that into a usual probability table, it comes out...
a -- Probability = 5/19... Entropy = -lg(5/19) = 1.93 bits.
b -- Probability = 3/19... Entropy = -lg(3/19) = 2.66 bits.
k -- Probability = 11/19... Entropy = -lg(11/19) = 0.79 bits.
This is as opposed to a standard order-0 model, where the same data would have given us the following entropies.
a -- 1.22 bits
b -- 1.22 bits
k -- 2.81 bits
This actually looks a little better, and the fact is that we really
do need data to be well-modeled by a Markov model for the compression to
really work. If that is not the case, then we don't have any gain.
The Markov model serves as our predictor and if the predictor works out
well, that means that there is less important data because the predictor
could guess it correctly, anyway. So with the blended context model,
the total data could theoretically be compressed to 14.56 bits. Without
the model, we can theoretically achieve 10.13 bits. So obviously,
this is a case where higher-order context models fail. We should
have noticed that something was up when the character that was least frequent
ends up being the only occurring character in one of the models.
If we had set the weights in the opposite direction, we actually might
have gotten some improvement. Moreover, what we're doing is really
combining these different orders of contexts in a mish-mash. If they
had remained separate, then compression improvement would have been all
but guaranteeable. It also might have been nice if our weights were
actually normalized to add up to 1.
The other real problem with blending is that
it's slow. Despite the fact that it gives fairly good compression,
there's another method, which is faster, this is the use of escape codes,
that's the method used by ppm and all the variants thereof.
Escape codes are the approach we use to solve
the "zero frequency" problem of Markovian models. As we go through
reading things out, we will sooner or later run across a character that
doesn't exist in the model. At this point, we're at odds to decide
which probability distribution to use.
The zero frequency problem specially arises
when using high orders, look at this text: "we have a model using one byte"
Let's say we are under a order-4 model, and we want to know the probability
of the next symbol (the context is "byte"), as we have never before seen
'byte', the probability of the next symbol appearing is 0, then in a context
where no symbol has appeared the most optimal solution seems to be a flat
distribution, 1/Z (1/256) which gives the same probability to every new
symbol. But should we assign an equal probability to all the symbols when
more than one symbol has appeared? Look at the following example: If you
have the following text "aa" and you have seen the symbol 'a' once and
you start with a flat distribution, the probability of the next 'a' is
2/257 even if you see it again it will be 3/258 but you'll expect the probability
to be much more higher than 3/258, may be something like 2/3 (because its
frequency is 2 and nothing else has appeared), but you still have to leave
some probability for the case of a new (unseen in this context) symbol
appears. This is called an escape code, which means that the symbol which
we are currently coding is not in the current context, and thus we have
to try in another context. We start in a table (order-0) with all the symbols
(Z) and the escape code, all the symbols have a probability of 0 and the
escape code has a probability of 1. Let's say we have to code a given symbol
now, as it doesn't appear in the order-0 table (when only the escape codes
has a probability) you have to fallback to order-(-1) where all the symbols
have a flat probability and you can safely code the symbol. An escape code
is used to make the decoder know that it has to use the next table, we
always go from high orders to lower order, till we hit order-(-1) where
we can code the symbol in case it hasn't appeared in higher orders.
For example, the case of "aa" will be the
following: the first 'a' is not in the order-0 table, so we emit an escape
code (which is in the order-0 table) and then code the 'a' symbol with
the order-(-1) table, with a probability of 1/Z, once the symbol is coded
we have to update the order-0 table to reflect the "a" that has appeared
once, however we don't update the order-(-1) table, as its only purpose
is to code symbols which have never appeared, and symbols which have already
appeared will be found in higher orders (probably resulting in more compression).
Then the next 'a' is found in the order-0 table (which we have updated
and now has two symbols on it, "a" and the escape code) and now it has
a probability of 1/2, and thus we can code it to only one bit. In that
case note that the real probability of the second "a" was 1 however our
model assigned it a probability of 1/2, and then the coder coded the surprise
prize. The problem with a probability of 1 is that we have no way
of encoding it. Probability of 1 means entropy of 0 bits -- that
is, if we have no information, we can assume that there's an "a."
Or two a's... or an infinite stream of a's. Zero probability
as well as zero entropy are equivalent problems,
both of which are handled by escape code entries in the probability
table.
As you see this scheme of making the escape
code is a little bit inefficient, as it always has the same value, no matter
how many symbols we have seen till the date. Fortunately there are some
other ones. The first which I've described is method A (and thus a ppm
implementation which uses it, would be called ppma) Method B is based
on "Don't believe it till you've seen twice" by subtracting one to the
count of every symbol, but this is kind of old stuff, so let's go to see
the next one. In method C, the probability of the escape code is equal
to the number of different symbols seen. (also remember to add the escape
code probability to the total of probabilities) Using escape codes is an
alternative of blending, let's see the previous example ("baabkab") but
using escape codes with method C instead of blending: (E is the escape
code, whose probability is equal to all the different symbols in the current
context, obviously is not used in order-(-1) as it already has all the
symbols).
order 2: context "ab" -- a:0, b:0, k:1, d:0, E:1 -- Total : 2
order 1: context "b" -- a:1, b:0, k:1, d:0, E:2 -- Total : 4
order 0: no context -- a:3, b:3, k:2, d:0, E:3 -- Total
: 11
order -1: no context -- a:1, b:1, k:1, d:1, E:0 -- Total : 4
So let's try and imagine coding of the next symbol after coding the
"baabkab" --
a -- (-lg(1/2) + -lg(1/4)) = 3 bits
b -- (-lg(1/2) + -lg(2/4) + -lg(3/11)) = 3.874 bits
k -- (-lg(1/2)) = 1 bit
d -- (-lg(1/2) + -lg(2/4) + -lg(3/11) + -lg(1/4)) = 5.874 bits
Now for an explanation...
After "baabkab" as been coded, the tables
would look like those above, so if we try to code a "b", we start at the
highest order, as in order-2 the probability of the symbol "b" is 0, we
have to emit an escape code whose probability in the order-2 table is of
1/2, then we analyse the order-1 table, here it's probability is also 0,
and so we have again another escape code, of course using the escape code
probability of order-1 table, which results to be 2/4 (because there are
two different symbols in this context, "a" and "c"), once we are in order-0
we predict the "b" with a probability of 3/11, and thus finally we code
it. Note that we have not coded "b" in one code but in three. In case a
"d" appears, we would emit escape codes in all the tables except in the
order-(-1) table, which has a flat probability distribution for all the
symbols. (and thus it can predict all the symbols which may appear in the
message but whose frequencies till the order-(-1) fallback are 0).
That's why there even happens to be a "d" in the distributions despite
the fact that there's no "d" in the stream. You could say that the
"d" takes the place of the escape code in the order-(-1) table, but the
main thing is that this is a dynamically updated probability distribution.
Somewhere down the line of a longer stream, we could run across a "d,"
so it's different from an escape code in that it could potentially be part
of the alphabet later. What we have seen in this example is a simple
ppmc implementation. Far from complete, I know... but if I
went into detail on ppmc, I'd probably fill 22 pages or so... which
is still not much, but no one would read it.
As mentioned before, the space and processing
time requirements for orders above 5 is usually not worth it. This
is taken to wit in basic ppm implementations like ppma and ppmc.
Regardless, ppmd, ppm*, ppmz and such will make use of higher orders to
achieve better compression above all else.
In short, ppm is a finite context adaptive
model which uses arithmetic coding as a coder, and tries to accurately
predict the next symbol, using the current context in a context tree (or
any other data structure like a suffix tree or hash tables) For predicting
the current symbol, using the highest order possible and going to a lower
order (using escape codes) till the symbol is found. Actually this
is not true for all the versions, ppm* chooses the lowest deterministic
context, and ppmz chooses the highest deterministic context, or ANY context
via Local Order Estimation, which is mostly used to avoid coding extraneous
escape codes. If the ideal order is guessed, we don't have to work
our way up or down in orders, and just use the appropriate context and
work from there. Ppmc achieves an average of around 2.4 bpb on the Calgary
Corpus Set, and Charles Bloom's ppmz is the current state of art with 1.9
bpb average. Notably, though, those averages are skewed slightly
by the fact that ppm variants have unusually high compression ratios on
image files. This is more true of
ppmz than ppmc, though. Some of the known "nearly uncompressible"
images can actually be compressed better by ppmz than by lossy methods.
However, the corpori are created to represent an average range of files,
and so they do include examples of files that cannot be compressed very
well.
The difference between finite context models are
not only the order or escape code method, but also how often the probabilities
are updated. In this direction, we have three major kinds of models:
Now, to finish off -- some resources :
http://www.cbloom.com/
Homepage of Charles Bloom, creator of LZP and ppmz.
http://www.compressconsult.com/rangecoder/
Info on the Range encoder; a high-performance variant of arithmetic coding.
ftp://ftp.elf.stuba.sk/pub/pc/pack/
ppmd/ppmz source code here.