The Joy of Finite Context Modeling

Parashar Krishnamachari

     Among all the useful compression algorithms of today, finite context modeling is the base model that we generally use. Both the very fast compressors as well as the state-of-the-art compressors will use it. The best known methods are the ppm (Prediction by Partial Matching) variants, which are all fundamentally alike except for their escape code models. But even simpler models like huffman coding are really finite context models, albeit single-character contexts. Even archaic methods like the innumerable lz variants are using finite context models to select offsets. Basically, the entropy of a data set is as low as we can ever hope to compress any given dataset. Arithmetic coders achieve bitrates almost exactly equal to the actual entropy. However, we can augment these coders with pre-processes and predictors to transform the data to a specific subspace in which the entropy is lower. e.g. A context like "abcdefghi" is actually uncompressible because every character is absolutely unique. However, in a differential space, we can write it as "a11111111," which is very low entropy because it has only 2 characters. The differential space transformation can be thought of as a 1st-order context because it is a comparison against the immediately prior character. In reality, that's an oversimplification of what context models represent, as there is no real probability distribution in a differential transformation.
     Let's say that we run across a message like "987654321012"... This has all the decimal digits in it. So our alphabet for this message is [0..9], which is an alphabet size of 10. The message length is 12, however... and we all know about pigeonholes. We could include our favorite EOF symbol, which is probably in that file that contains that message, but if we store the length (12), then we really don't have to worry about it. When you're compressing files, having a special EOF character is your bane, anyway. It not only adds one more thing to worry about, but it also throws off your probability distributions. In this example, the frequency is mostly in the "1" and the "2" symbol, which both have frequencies of 2. The other symbols have frequencies of 1. We can someday get around to compressing this useless message, with a length that is known beforehand, and we know that its symbol type is bytes... yipee. Some data like sound or images can be lossily compressed, that is we can choose to only code the information which is more relevant and throw away the rest of it, thus introducing error on it. On the other hand, the loss of data is something which we can't afford with text or binary data. Just think about the mess of a text with almost random characters, or corrupted binary data which our programs may need to run. Due to the need of a decompressed file identical to the original one, we have to use lossless compression for this kind of data. In most of the cases we name this kind of compression (the compression of files without loss) text compression, commonly english text compression, as English has a structure which has been taken as representative for any language. Also this structure is not very different of that of binary files, and thus if a given compressor performs well with text, it's going to have a good performance with general binary data.
     The standards that we usually use for text compression include measuring compression ratios in bpb -- bits per byte. Also referred to as bpc -- bits per character. The other issue is the datasets. We generally use a standard set of files for testing text compression. The most common of olden days and still used today is the Calgary Corpus. A larger and more widespread fileset would be the Canterbury Corpus. Both can be downloaded at http://corpus.canterbury.ac.nz/     The same site also includes a simple tool that executes several existing compressors to test relative performance (in a style akin to SPEC ratings).
     Now let's get down to the meat....
     Finite context models assign a probability to a symbol based on the frequencies of the symbols that happened before the current symbol to code. Before because when we code a symbol based on the previous symbols, we can be sure that the decoder will have this information too. (as the previous symbols have been already transmitted) Finite context modeling is based in the fact that a symbol that has appeared recently will appear in a near future, so the more times that it appears, the higher probability that it appears again, so every time it's seen we increment its probability.
     Since the entropy is as low as we can get, and arithmetic coding achieves it, I'm just going to use the entropy as a reference point from here on out. Also, the notation lg refers to log-base-2. Let's also try to keep some notion of scale here. Obviously, a file of 10 bytes is not going to come down to something all that close to its total entropy, whereas a file of 10k can come pretty close, and a file of 10 MB can come close enough to be pretty much indistinguishable from dead-on entropy size.
     Let's say we have a message like "aaabbc" and a finite context model which assigns a probability of symbol_frequency/total_frequency to every symbol, then we have:

a -- Frequency = 3... Probability = 3/6... Entropy = -lg(3/6) = 1 bit.
b -- Frequency = 2... Probability = 2/6... Entropy = -lg(2/6) = 1.59 bits.
c -- Frequency = 1... Probability = 1/6... Entropy = -lg(1/6) = 2.59 bits.

If we just use the entropies, (3*1)+(1.59*2)+2.59 = 8.77 bits. So what happens if we try to monkey with the probability distribution so that the most frequent gets a bigger slice of the pie? Can we beat the entropy by changing individual entropies with fake ones?

a -- Probability = 3/6... Fake = 9/12... Entropy = -lg(9/12) = 0.42 bits.
b -- Probability = 2/6... Fake = 2/12... Entropy = -lg(2/12) = 2.59 bits.
c -- Probability = 1/6... Fake = 1/12... Entropy = -lg(1/12) = 3.59 bits.

Well, the entropy of the most frequent was brought down, so hopefully we should have an improvement. But we're human, so hope is often a pipe dream.
(3*0.42)+(2.59*2)+3.59 = 10.03 bits. Obviously, just putting fake probabilities into the system is just not very effective. The most common symbol went in with fewer bits, but it still didn't help because the others grew too much. So to really improve compression beyond arithmetic coding, we need a model that actually brings up probabilities of EVERYTHING (at least on average), and thus achieve higher compression. It's fairly safe to refer to finite context models as Markov models... it's just that I'd prefer that no one confuses them with finite state models. There do exist models for infinite-order Markov chains, but that's hardly practical. It's doable in the sense that an entire file can be one of the contexts, in which case, we're at a point where there is no more data (so it is at the absolute limit... very much a hand-waving definition of "infinite"). Since it IS effectively a Markov, we can really show that there
is no real way to improve on the computed entropy simply by using the Kraft inequalities. One of the deep dark secrets that we've all discovered after all the variants on ppm have been developed is that contexts of 8th order tend to work best in most cases... better than everything except infinite order. Infinite order contexts, however, require absurd amounts of memory and the cheapest way to implement them is through reinforcement learning of sorts... but the problem with using reinforcement learning is that it can solve infinite order chains, but not the [1..infy-1]-order chains. It is effectively a blended
infinite-order context model. While order-8 models achieve the best compression of the data itself, it's often been found that anything beyond order-5 is not worth it. The performance hit beyond order-5 is too great for a very small gain. Moreover, you open yourself up to very high header costs as you will
need many more probability distribution tables as orders get higher. At 9th order contexts or so, the increase in bandwidth requirements for the probability tables is almost always greater than the gain in comopression, and files sizes start to go up. Depending on the data, this can even happen as low as 3rd order models, but in general, order-9 is about where it's almost always the case.
     Basically, the problem here is that we've assigned high probability only to the single most probable symbol. To really improve compression, the most common symbols (note plural) need have higher probability. If we look for character occurences on the whole, we're probably not going to get anywhere near as high of probabilities as when we look at contexts. Note that we ARE dealing with real probabilities and not ones that have been pushed in a fake and dirty manner. The fact that they are real probabilities means that high probability symbols not only take up less data, but also occur more often by a similar scale, so we can expect low data counts rather often. The context of a given symbol is the finite length sequence of symbols before or after it. In finite context modeling we use the symbols before it. Because we can be sure that the decompressor's model will have this information. That ensures that it can have the same probabilities as the compressor's model and there would be no problem decoding the symbols. In the other hand if the compressor and decompressor have different probability distributions, the same code may mean different things, and thus we would get corrupted data. Due to this fact, a rule which you must follow is that both compressor and decompressor must have the same model always -- if you want to do lossless compression, that is...
     If I took a context like "whi," we could probably say that the letter "j" has SOME probability of following that string. But the most likely characters to follow would be "t" (white) or "c" (which). There is an lower but existent probability for the next character to be an "r" (whirl)... not as high as "white" or "which," but certainly higher than "whiqx" or some other outlandishly improbable string. In the previous example model, the probability of a symbol was taken from the number of time that it appeared in the file, when using the information of context we also have to take care of what is the context of the current symbol, and thus we take only the probability of the symbols which have appeared under this context. Both text and binary data are context-sensitive, that is, under a given context the same symbols tend to appear, so we can get a more reliable probability. Binary data tends to be context
sensitive, however it exhibits a more random structure than text. This is especially true of executables and less true of binary data that can potentially be compressed in a lossy fashion. Lossily compressible data can be fit well by alternative models.
     Let's take an example like the following -- "the text"... We are currently at the very last symbol, in this case, a 't.' The other times that we saw a t, we had either an h or an e following it. So that means that now that we've seen another t, the probability that the next character is an 'h' is 1/2. The probability that the next character is an 'e' is also 1/2. In the case of ppm variants, we'd also include space in the probability distribution for an "escape code" -- i.e., something else that doesn't otherwise fit in our model. If we didn't take context information into account, the probability of an 'e' would have been 1/4, or an entropy of 2 bits... for that matter, the probability of an 'h' would have been 1/8, which translates to an entropy of 3 bits. With the context information, we've cut both down to 1 bit each. To actually implement this, we can't really use a single probability table. We really need N tables. Since our context sizes are only 1 character wide, N is just the alphabet size. Depending on the context, we'd simply choose one. In this example, we'd have an alphabet size of 5 (or 6 if we include an escape code). Obviously, if we were playing with order-3 models, our contexts would be 3 characters wide, and we might have more tables. The reason I say "might" is because not all possible contexts will exist in a given file. There's also a special order used in ppm, called order-(-1), which is just a flat probability distribution for all the symbols.
     Once we have different orders the question of how to use them arises. One of the solutions is blending, which is simply combining the probability of all the orders in a single probability for every symbol. This is done by adding together the probability of the same symbol under different contexts. However as we have seen probabilities of higher orders tend to be more reliable and thus achieve higher compression, the way we reflect this in blending is by weighting higher orders, that is multiplying them to give them more importance when they are added together. Every context has a different weight which we can choose beforehand, or change while compressing.
     Again, another example... the string this time is "baabkab" and this time, we've got an order-2-1-0 context model. Let's take the probability distribution from each order and weight them 8, 2, and 1... Then these distributions will be blended and we'll see what we get.

Order 2... context "ab" -- frequencies : a = 0, b = 0, k = 1
Order 1... context "b" -- frequencies : a = 1, b = 0, k = 1
Order 0... null context -- frequencies : a = 3, b = 3, k = 1

Note that there are no context like "ba" because we've actually computed the probability of them occurring in the "b" context alone, so there's no need to compute a second dependency as it wil actually rob us of some compression. If the model were just order-2 as opposed to order-2-1-0, we wouldn't bother with that sort of combining. In practice, we'll actually use order-2 models because of performance and because of the fact that we'll only lose compression a few symbols at a time. Anyway, let's apply the weights and move on...

Order 2... weighted frequencies : a = 0, b = 0, k = 8
Order 1... weighted frequencies : a = 2, b = 0, k = 2
Order 0... weighted frequencies : a = 3, b = 3, k = 1

Total : a = 5, b = 3, k = 11

So putting that into a usual probability table, it comes out...

a -- Probability = 5/19... Entropy = -lg(5/19) = 1.93 bits.
b -- Probability = 3/19... Entropy = -lg(3/19) = 2.66 bits.
k -- Probability = 11/19... Entropy = -lg(11/19) = 0.79 bits.

This is as opposed to a standard order-0 model, where the same data would have given us the following entropies.

a -- 1.22 bits
b -- 1.22 bits
k -- 2.81 bits

This actually looks a little better, and the fact is that we really do need data to be well-modeled by a Markov model for the compression to really work. If that is not the case, then we don't have any gain. The Markov model serves as our predictor and if the predictor works out well, that means that there is less important data because the predictor could guess it correctly, anyway. So with the blended context model, the total data could theoretically be compressed to 14.56 bits. Without the model, we can theoretically achieve 10.13 bits. So obviously, this is a case where higher-order context models fail. We should have noticed that something was up when the character that was least frequent ends up being the only occurring character in one of the models. If we had set the weights in the opposite direction, we actually might have gotten some improvement. Moreover, what we're doing is really combining these different orders of contexts in a mish-mash. If they had remained separate, then compression improvement would have been all but guaranteeable. It also might have been nice if our weights were actually normalized to add up to 1.
     The other real problem with blending is that it's slow. Despite the fact that it gives fairly good compression, there's another method, which is faster, this is the use of escape codes, that's the method used by ppm and all the variants thereof.
     Escape codes are the approach we use to solve the "zero frequency" problem of Markovian models. As we go through reading things out, we will sooner or later run across a character that doesn't exist in the model. At this point, we're at odds to decide which probability distribution to use.
     The zero frequency problem specially arises when using high orders, look at this text: "we have a model using one byte" Let's say we are under a order-4 model, and we want to know the probability of the next symbol (the context is "byte"), as we have never before seen 'byte', the probability of the next symbol appearing is 0, then in a context where no symbol has appeared the most optimal solution seems to be a flat distribution, 1/Z (1/256) which gives the same probability to every new symbol. But should we assign an equal probability to all the symbols when more than one symbol has appeared? Look at the following example: If you have the following text "aa" and you have seen the symbol 'a' once and you start with a flat distribution, the probability of the next 'a' is 2/257 even if you see it again it will be 3/258 but you'll expect the probability to be much more higher than 3/258, may be something like 2/3 (because its frequency is 2 and nothing else has appeared), but you still have to leave some probability for the case of a new (unseen in this context) symbol appears. This is called an escape code, which means that the symbol which we are currently coding is not in the current context, and thus we have to try in another context. We start in a table (order-0) with all the symbols (Z) and the escape code, all the symbols have a probability of 0 and the escape code has a probability of 1. Let's say we have to code a given symbol now, as it doesn't appear in the order-0 table (when only the escape codes has a probability) you have to fallback to order-(-1) where all the symbols have a flat probability and you can safely code the symbol. An escape code is used to make the decoder know that it has to use the next table, we always go from high orders to lower order, till we hit order-(-1) where we can code the symbol in case it hasn't appeared in higher orders.
     For example, the case of "aa" will be the following: the first 'a' is not in the order-0 table, so we emit an escape code (which is in the order-0 table) and then code the 'a' symbol with the order-(-1) table, with a probability of 1/Z, once the symbol is coded we have to update the order-0 table to reflect the "a" that has appeared once, however we don't update the order-(-1) table, as its only purpose is to code symbols which have never appeared, and symbols which have already appeared will be found in higher orders (probably resulting in more compression). Then the next 'a' is found in the order-0 table (which we have updated and now has two symbols on it, "a" and the escape code) and now it has a probability of 1/2, and thus we can code it to only one bit. In that case note that the real probability of the second "a" was 1 however our model assigned it a probability of 1/2, and then the coder coded the surprise prize. The problem with a probability of 1 is that we have no way of encoding it. Probability of 1 means entropy of 0 bits -- that is, if we have no information, we can assume that there's an "a." Or two a's... or an infinite stream of a's. Zero probability as well as zero entropy are equivalent problems,
both of which are handled by escape code entries in the probability table.
     As you see this scheme of making the escape code is a little bit inefficient, as it always has the same value, no matter how many symbols we have seen till the date. Fortunately there are some other ones. The first which I've described is method A (and thus a ppm implementation which uses it, would be called ppma) Method B is based on "Don't believe it till you've seen twice" by subtracting one to the count of every symbol, but this is kind of old stuff, so let's go to see the next one. In method C, the probability of the escape code is equal to the number of different symbols seen. (also remember to add the escape code probability to the total of probabilities) Using escape codes is an alternative of blending, let's see the previous example ("baabkab") but using escape codes with method C instead of blending: (E is the escape code, whose probability is equal to all the different symbols in the current context, obviously is not used in order-(-1) as it already has all the symbols).

order 2: context "ab" -- a:0, b:0, k:1, d:0, E:1 -- Total : 2
order 1: context "b" -- a:1, b:0, k:1, d:0, E:2 -- Total : 4
order 0: no context -- a:3, b:3, k:2, d:0, E:3 -- Total : 11
order -1: no context -- a:1, b:1, k:1, d:1, E:0 -- Total : 4

So let's try and imagine coding of the next symbol after coding the "baabkab" --
a -- (-lg(1/2) + -lg(1/4)) = 3 bits
b -- (-lg(1/2) + -lg(2/4) + -lg(3/11)) = 3.874 bits
k -- (-lg(1/2)) = 1 bit
d -- (-lg(1/2) + -lg(2/4) + -lg(3/11) + -lg(1/4)) = 5.874 bits

Now for an explanation...
     After "baabkab" as been coded, the tables would look like those above, so if we try to code a "b", we start at the highest order, as in order-2 the probability of the symbol "b" is 0, we have to emit an escape code whose probability in the order-2 table is of 1/2, then we analyse the order-1 table, here it's probability is also 0, and so we have again another escape code, of course using the escape code probability of order-1 table, which results to be 2/4 (because there are two different symbols in this context, "a" and "c"), once we are in order-0 we predict the "b" with a probability of 3/11, and thus finally we code it. Note that we have not coded "b" in one code but in three. In case a "d" appears, we would emit escape codes in all the tables except in the order-(-1) table, which has a flat probability distribution for all the symbols. (and thus it can predict all the symbols which may appear in the message but whose frequencies till the order-(-1) fallback are 0). That's why there even happens to be a "d" in the distributions despite the fact that there's no "d" in the stream. You could say that the "d" takes the place of the escape code in the order-(-1) table, but the main thing is that this is a dynamically updated probability distribution. Somewhere down the line of a longer stream, we could run across a "d," so it's different from an escape code in that it could potentially be part of the alphabet later. What we have seen in this example is a simple ppmc implementation. Far from complete, I know... but if I went into detail on ppmc, I'd probably fill 22 pages or so... which is still not much, but no one would read it.
     As mentioned before, the space and processing time requirements for orders above 5 is usually not worth it. This is taken to wit in basic ppm implementations like ppma and ppmc. Regardless, ppmd, ppm*, ppmz and such will make use of higher orders to achieve better compression above all else.
     In short, ppm is a finite context adaptive model which uses arithmetic coding as a coder, and tries to accurately predict the next symbol, using the current context in a context tree (or any other data structure like a suffix tree or hash tables) For predicting the current symbol, using the highest order possible and going to a lower order (using escape codes) till the symbol is found. Actually this is not true for all the versions, ppm* chooses the lowest deterministic context, and ppmz chooses the highest deterministic context, or ANY context via Local Order Estimation, which is mostly used to avoid coding extraneous escape codes. If the ideal order is guessed, we don't have to work our way up or down in orders, and just use the appropriate context and work from there. Ppmc achieves an average of around 2.4 bpb on the Calgary Corpus Set, and Charles Bloom's ppmz is the current state of art with 1.9 bpb average. Notably, though, those averages are skewed slightly by the fact that ppm variants have unusually high compression ratios on image files. This is more true of
ppmz than ppmc, though. Some of the known "nearly uncompressible" images can actually be compressed better by ppmz than by lossy methods. However, the corpori are created to represent an average range of files, and so they do include examples of files that cannot be compressed very well.
    The difference between finite context models are not only the order or escape code method, but also how often the probabilities are updated. In this direction, we have three major kinds of models:

Static. A static model uses predefined probabilities, which both the coder and decoder know beforehand (and thus we don't need to transmit them), with those fixed probabilities (we never update them) we code the whole file. This of course has the drawback that if the file doesn't match our probabilities we are going to expand the file instead. For example, if we have probabilities for English text, and code Japanese text the result will be bad. Of course this is the fastest one. But this is not actually used except in a few applications. (for example it's not used in archivers, but it IS used for coding the probability of the symbols of a Jpeg file).
Semi static. This models does two passes, the first to gather probabilities for all the symbols, then once this is known, it outputs the probability distribution, and starts to output the codes for every symbol. It's still fast, and in most of the cases it does a good job, however it has some inefficiencies, like that it has to pass the whole probability distribution (usually this is only applied to order-0, as going to higher orders, would need to pass a huge amount of probabilities) or that it does not locally adapts. Just an example if we have the file "aaaaaabbbbbb" it will code both 'a' and 'b' down to one bit, even though we can do better. Also this kind of model is called semi adaptative, however it's widely just called static model. (Since the pure static model is hardly ever used)
Adaptive. It processes the file symbol by symbol, in the following way: code symbol, update model. Thus it has accurate probabilities and doesn't need to tell the decompressor which is the probability distribution before using it, of course it has a drawback, it's slow, even if there are some schemes for speeding it up (both for huffman and arithmetic coding) we have to update the model for every symbol processed. This model however, has an interesting application in so-called "online compression" where a server and client both keep track of probability distributions and packets can be coded with the same model we had acquired based on the previous compressed packet. It achieves better compression than static models, and we don't really need to transfer any probabilities. The difficulty here is that as packets are compiled to send across the network, we could have some encoded data getting cut off from a packet, which could mix in the next packet. And then we have a packet where some data uses a prior update of the probabilities, and some uses something more up to date. The simplest solution for this is to plop in a distinctive flag that marks the beginning of a new context frame, and we simply don't use a new update on the receiving end until we see that flag. The other possibility is to use a fixed packet size in the uncompressed space. Note that if we use an adaptive model on a file, then it would work best to have an initial state of probabilities stored in the header of our compressed file. Otherwise, we'd have virtually no compression at all for a small chunk out of the file.

Of course there are also differences on how a model updates their probabilities (for example in most ppm implementations updating only probabilities where the order is >= the order where the symbol was found; This is called "exclusion") but this is something which changes with every model, and will be analysed when they are going to be used.

Now, to finish off -- some resources :
http://www.cbloom.com/ Homepage of Charles Bloom, creator of LZP and ppmz.
http://www.compressconsult.com/rangecoder/ Info on the Range encoder; a high-performance variant of arithmetic coding.
ftp://ftp.elf.stuba.sk/pub/pc/pack/ ppmd/ppmz source code here.