a -- Frequency = 3... Probability = 3/6... Entropy = -lg(3/6)
= 1 bit.
b -- Frequency = 2... Probability = 2/6... Entropy = -lg(2/6) = 1.59 bits.
c -- Frequency = 1... Probability = 1/6... Entropy = -lg(1/6) = 2.59 bits.
If we just use the entropies, (3*1)+(1.59*2)+2.59 = 8.77 bits. So what happens if we try to monkey with the probability distribution so that the most frequent gets a bigger slice of the pie? Can we beat the entropy by changing individual entropies with fake ones?
a -- Probability = 3/6... Fake = 9/12... Entropy = -lg(9/12)
= 0.42 bits.
b -- Probability = 2/6... Fake = 2/12... Entropy = -lg(2/12) = 2.59 bits.
c -- Probability = 1/6... Fake = 1/12... Entropy = -lg(1/12) = 3.59 bits.
Well, the entropy of the most frequent was brought down, so hopefully
we should have an improvement. But we're human, so hope is often
a pipe dream.
(3*0.42)+(2.59*2)+3.59 = 10.03 bits. Obviously, just putting fake probabilities into the system is just not very effective. The most common symbol went in with fewer bits, but it still didn't help because the others grew too much. So to really improve compression beyond arithmetic coding, we need a model that actually brings up probabilities of EVERYTHING (at least on average), and thus achieve higher compression. It's fairly safe to refer to finite context models as Markov models... it's just that I'd prefer that no one confuses them with finite state models. There do exist models for infinite-order Markov chains, but that's hardly practical. It's doable in the sense that an entire file can be one of the contexts, in which case, we're at a point where there is no more data (so it is at the absolute limit... very much a hand-waving definition of "infinite"). Since it IS effectively a Markov, we can really show that there
is no real way to improve on the computed entropy simply by using the Kraft inequalities. One of the deep dark secrets that we've all discovered after all the variants on ppm have been developed is that contexts of 8th order tend to work best in most cases... better than everything except infinite order. Infinite order contexts, however, require absurd amounts of memory and the cheapest way to implement them is through reinforcement learning of sorts... but the problem with using reinforcement learning is that it can solve infinite order chains, but not the [1..infy-1]-order chains. It is effectively a blended
infinite-order context model. While order-8 models achieve the best compression of the data itself, it's often been found that anything beyond order-5 is not worth it. The performance hit beyond order-5 is too great for a very small gain. Moreover, you open yourself up to very high header costs as you will
need many more probability distribution tables as orders get higher. At 9th order contexts or so, the increase in bandwidth requirements for the probability tables is almost always greater than the gain in comopression, and files sizes start to go up. Depending on the data, this can even happen as low as 3rd order models, but in general, order-9 is about where it's almost always the case.
Basically, the problem here is that we've assigned high probability only to the single most probable symbol. To really improve compression, the most common symbols (note plural) need have higher probability. If we look for character occurences on the whole, we're probably not going to get anywhere near as high of probabilities as when we look at contexts. Note that we ARE dealing with real probabilities and not ones that have been pushed in a fake and dirty manner. The fact that they are real probabilities means that high probability symbols not only take up less data, but also occur more often by a similar scale, so we can expect low data counts rather often. The context of a given symbol is the finite length sequence of symbols before or after it. In finite context modeling we use the symbols before it. Because we can be sure that the decompressor's model will have this information. That ensures that it can have the same probabilities as the compressor's model and there would be no problem decoding the symbols. In the other hand if the compressor and decompressor have different probability distributions, the same code may mean different things, and thus we would get corrupted data. Due to this fact, a rule which you must follow is that both compressor and decompressor must have the same model always -- if you want to do lossless compression, that is...
If I took a context like "whi," we could probably say that the letter "j" has SOME probability of following that string. But the most likely characters to follow would be "t" (white) or "c" (which). There is an lower but existent probability for the next character to be an "r" (whirl)... not as high as "white" or "which," but certainly higher than "whiqx" or some other outlandishly improbable string. In the previous example model, the probability of a symbol was taken from the number of time that it appeared in the file, when using the information of context we also have to take care of what is the context of the current symbol, and thus we take only the probability of the symbols which have appeared under this context. Both text and binary data are context-sensitive, that is, under a given context the same symbols tend to appear, so we can get a more reliable probability. Binary data tends to be context
sensitive, however it exhibits a more random structure than text. This is especially true of executables and less true of binary data that can potentially be compressed in a lossy fashion. Lossily compressible data can be fit well by alternative models.
Let's take an example like the following -- "the text"... We are currently at the very last symbol, in this case, a 't.' The other times that we saw a t, we had either an h or an e following it. So that means that now that we've seen another t, the probability that the next character is an 'h' is 1/2. The probability that the next character is an 'e' is also 1/2. In the case of ppm variants, we'd also include space in the probability distribution for an "escape code" -- i.e., something else that doesn't otherwise fit in our model. If we didn't take context information into account, the probability of an 'e' would have been 1/4, or an entropy of 2 bits... for that matter, the probability of an 'h' would have been 1/8, which translates to an entropy of 3 bits. With the context information, we've cut both down to 1 bit each. To actually implement this, we can't really use a single probability table. We really need N tables. Since our context sizes are only 1 character wide, N is just the alphabet size. Depending on the context, we'd simply choose one. In this example, we'd have an alphabet size of 5 (or 6 if we include an escape code). Obviously, if we were playing with order-3 models, our contexts would be 3 characters wide, and we might have more tables. The reason I say "might" is because not all possible contexts will exist in a given file. There's also a special order used in ppm, called order-(-1), which is just a flat probability distribution for all the symbols.
Once we have different orders the question of how to use them arises. One of the solutions is blending, which is simply combining the probability of all the orders in a single probability for every symbol. This is done by adding together the probability of the same symbol under different contexts. However as we have seen probabilities of higher orders tend to be more reliable and thus achieve higher compression, the way we reflect this in blending is by weighting higher orders, that is multiplying them to give them more importance when they are added together. Every context has a different weight which we can choose beforehand, or change while compressing.
Again, another example... the string this time is "baabkab" and this time, we've got an order-2-1-0 context model. Let's take the probability distribution from each order and weight them 8, 2, and 1... Then these distributions will be blended and we'll see what we get.
Order 2... context "ab" -- frequencies : a = 0, b = 0, k = 1
Order 1... context "b" -- frequencies : a = 1, b = 0, k = 1
Order 0... null context -- frequencies : a = 3, b = 3, k = 1
Note that there are no context like "ba" because we've actually computed the probability of them occurring in the "b" context alone, so there's no need to compute a second dependency as it wil actually rob us of some compression. If the model were just order-2 as opposed to order-2-1-0, we wouldn't bother with that sort of combining. In practice, we'll actually use order-2 models because of performance and because of the fact that we'll only lose compression a few symbols at a time. Anyway, let's apply the weights and move on...
Order 2... weighted frequencies : a = 0, b = 0, k = 8
Order 1... weighted frequencies : a = 2, b = 0, k = 2
Order 0... weighted frequencies : a = 3, b = 3, k = 1
Total : a = 5, b = 3, k = 11
So putting that into a usual probability table, it comes out...
a -- Probability = 5/19... Entropy = -lg(5/19) = 1.93 bits.
b -- Probability = 3/19... Entropy = -lg(3/19) = 2.66 bits.
k -- Probability = 11/19... Entropy = -lg(11/19) = 0.79 bits.
This is as opposed to a standard order-0 model, where the same data would have given us the following entropies.
a -- 1.22 bits
b -- 1.22 bits
k -- 2.81 bits
This actually looks a little better, and the fact is that we really
do need data to be well-modeled by a Markov model for the compression to
really work. If that is not the case, then we don't have any gain.
The Markov model serves as our predictor and if the predictor works out
well, that means that there is less important data because the predictor
could guess it correctly, anyway. So with the blended context model,
the total data could theoretically be compressed to 14.56 bits. Without
the model, we can theoretically achieve 10.13 bits. So obviously,
this is a case where higher-order context models fail. We should
have noticed that something was up when the character that was least frequent
ends up being the only occurring character in one of the models.
If we had set the weights in the opposite direction, we actually might
have gotten some improvement. Moreover, what we're doing is really
combining these different orders of contexts in a mish-mash. If they
had remained separate, then compression improvement would have been all
but guaranteeable. It also might have been nice if our weights were
actually normalized to add up to 1.
The other real problem with blending is that it's slow. Despite the fact that it gives fairly good compression, there's another method, which is faster, this is the use of escape codes, that's the method used by ppm and all the variants thereof.
Escape codes are the approach we use to solve the "zero frequency" problem of Markovian models. As we go through reading things out, we will sooner or later run across a character that doesn't exist in the model. At this point, we're at odds to decide which probability distribution to use.
The zero frequency problem specially arises when using high orders, look at this text: "we have a model using one byte" Let's say we are under a order-4 model, and we want to know the probability of the next symbol (the context is "byte"), as we have never before seen 'byte', the probability of the next symbol appearing is 0, then in a context where no symbol has appeared the most optimal solution seems to be a flat distribution, 1/Z (1/256) which gives the same probability to every new symbol. But should we assign an equal probability to all the symbols when more than one symbol has appeared? Look at the following example: If you have the following text "aa" and you have seen the symbol 'a' once and you start with a flat distribution, the probability of the next 'a' is 2/257 even if you see it again it will be 3/258 but you'll expect the probability to be much more higher than 3/258, may be something like 2/3 (because its frequency is 2 and nothing else has appeared), but you still have to leave some probability for the case of a new (unseen in this context) symbol appears. This is called an escape code, which means that the symbol which we are currently coding is not in the current context, and thus we have to try in another context. We start in a table (order-0) with all the symbols (Z) and the escape code, all the symbols have a probability of 0 and the escape code has a probability of 1. Let's say we have to code a given symbol now, as it doesn't appear in the order-0 table (when only the escape codes has a probability) you have to fallback to order-(-1) where all the symbols have a flat probability and you can safely code the symbol. An escape code is used to make the decoder know that it has to use the next table, we always go from high orders to lower order, till we hit order-(-1) where we can code the symbol in case it hasn't appeared in higher orders.
For example, the case of "aa" will be the following: the first 'a' is not in the order-0 table, so we emit an escape code (which is in the order-0 table) and then code the 'a' symbol with the order-(-1) table, with a probability of 1/Z, once the symbol is coded we have to update the order-0 table to reflect the "a" that has appeared once, however we don't update the order-(-1) table, as its only purpose is to code symbols which have never appeared, and symbols which have already appeared will be found in higher orders (probably resulting in more compression). Then the next 'a' is found in the order-0 table (which we have updated and now has two symbols on it, "a" and the escape code) and now it has a probability of 1/2, and thus we can code it to only one bit. In that case note that the real probability of the second "a" was 1 however our model assigned it a probability of 1/2, and then the coder coded the surprise prize. The problem with a probability of 1 is that we have no way of encoding it. Probability of 1 means entropy of 0 bits -- that is, if we have no information, we can assume that there's an "a." Or two a's... or an infinite stream of a's. Zero probability as well as zero entropy are equivalent problems,
both of which are handled by escape code entries in the probability table.
As you see this scheme of making the escape code is a little bit inefficient, as it always has the same value, no matter how many symbols we have seen till the date. Fortunately there are some other ones. The first which I've described is method A (and thus a ppm implementation which uses it, would be called ppma) Method B is based on "Don't believe it till you've seen twice" by subtracting one to the count of every symbol, but this is kind of old stuff, so let's go to see the next one. In method C, the probability of the escape code is equal to the number of different symbols seen. (also remember to add the escape code probability to the total of probabilities) Using escape codes is an alternative of blending, let's see the previous example ("baabkab") but using escape codes with method C instead of blending: (E is the escape code, whose probability is equal to all the different symbols in the current context, obviously is not used in order-(-1) as it already has all the symbols).
order 2: context "ab" -- a:0, b:0, k:1, d:0, E:1 -- Total : 2
order 1: context "b" -- a:1, b:0, k:1, d:0, E:2 -- Total : 4
order 0: no context -- a:3, b:3, k:2, d:0, E:3 -- Total : 11
order -1: no context -- a:1, b:1, k:1, d:1, E:0 -- Total : 4
So let's try and imagine coding of the next symbol after coding the
a -- (-lg(1/2) + -lg(1/4)) = 3 bits
b -- (-lg(1/2) + -lg(2/4) + -lg(3/11)) = 3.874 bits
k -- (-lg(1/2)) = 1 bit
d -- (-lg(1/2) + -lg(2/4) + -lg(3/11) + -lg(1/4)) = 5.874 bits
Now for an explanation...
After "baabkab" as been coded, the tables would look like those above, so if we try to code a "b", we start at the highest order, as in order-2 the probability of the symbol "b" is 0, we have to emit an escape code whose probability in the order-2 table is of 1/2, then we analyse the order-1 table, here it's probability is also 0, and so we have again another escape code, of course using the escape code probability of order-1 table, which results to be 2/4 (because there are two different symbols in this context, "a" and "c"), once we are in order-0 we predict the "b" with a probability of 3/11, and thus finally we code it. Note that we have not coded "b" in one code but in three. In case a "d" appears, we would emit escape codes in all the tables except in the order-(-1) table, which has a flat probability distribution for all the symbols. (and thus it can predict all the symbols which may appear in the message but whose frequencies till the order-(-1) fallback are 0). That's why there even happens to be a "d" in the distributions despite the fact that there's no "d" in the stream. You could say that the "d" takes the place of the escape code in the order-(-1) table, but the main thing is that this is a dynamically updated probability distribution. Somewhere down the line of a longer stream, we could run across a "d," so it's different from an escape code in that it could potentially be part of the alphabet later. What we have seen in this example is a simple ppmc implementation. Far from complete, I know... but if I went into detail on ppmc, I'd probably fill 22 pages or so... which is still not much, but no one would read it.
As mentioned before, the space and processing time requirements for orders above 5 is usually not worth it. This is taken to wit in basic ppm implementations like ppma and ppmc. Regardless, ppmd, ppm*, ppmz and such will make use of higher orders to achieve better compression above all else.
In short, ppm is a finite context adaptive model which uses arithmetic coding as a coder, and tries to accurately predict the next symbol, using the current context in a context tree (or any other data structure like a suffix tree or hash tables) For predicting the current symbol, using the highest order possible and going to a lower order (using escape codes) till the symbol is found. Actually this is not true for all the versions, ppm* chooses the lowest deterministic context, and ppmz chooses the highest deterministic context, or ANY context via Local Order Estimation, which is mostly used to avoid coding extraneous escape codes. If the ideal order is guessed, we don't have to work our way up or down in orders, and just use the appropriate context and work from there. Ppmc achieves an average of around 2.4 bpb on the Calgary Corpus Set, and Charles Bloom's ppmz is the current state of art with 1.9 bpb average. Notably, though, those averages are skewed slightly by the fact that ppm variants have unusually high compression ratios on image files. This is more true of
ppmz than ppmc, though. Some of the known "nearly uncompressible" images can actually be compressed better by ppmz than by lossy methods. However, the corpori are created to represent an average range of files, and so they do include examples of files that cannot be compressed very well.
The difference between finite context models are not only the order or escape code method, but also how often the probabilities are updated. In this direction, we have three major kinds of models:
Now, to finish off -- some resources :
http://www.cbloom.com/ Homepage of Charles Bloom, creator of LZP and ppmz.
http://www.compressconsult.com/rangecoder/ Info on the Range encoder; a high-performance variant of arithmetic coding.
ftp://ftp.elf.stuba.sk/pub/pc/pack/ ppmd/ppmz source code here.