Forecasting as compression

Abstract

Intelligent agents need to forecast future events - to understand the potential consequences of their possible actions.

Forecasting has a close link with stream compression. However, that link is not necessarily obvious. This page gives details about the extent to which forecasting is equivalent to stream compression.

Compression - the basics

The basics of the analogy between stream compression and forecasting is based on the details of how stream compression systems work. Such systems develop a model of the stream they are compressing - and then use a coding scheme to represent that symbol compactly in an output stream.

Various coding schemes exist: Huffman coding, Arithmetic coding - and so on. Many of these are close to optimal. The interesting part of stream compression systems lies in the model of the stream.

The output of the model is the probability of the next symbol being seen - as it would have been calculated before it was observed. It is that symbol's surprise value - from the perspective of the compression system.

Unsuprising symbols are then assigned short output codes, and suprising symbols are given longer ones. With an accurate data model, this results in compression of the input stream.

The link to forecasting

From this description, the link to forecasting is fairly easily seen. The compression system forms an estimate of the probability of each symbol being observed. Since it could potentially observe any symbol, it must be able to form an estimate for each possible symbol. That is a forcast forward of one step. This forecast can then be extended out into the future - by iterating the process - by imagining streams of input data, and calculating their cumulative probabilities. This process provides a complete forecast, potentially covering the entire range of future possibilities.

The analogy works the other way around too. If you have a complete forecasting system, you can use that to construct a stream compressor, based on its forecasted probabilities.

Partial forecasts

If you have a complete forecast, you can use it to compress. However, in practice, many forecasts are partial. Partial forecasts can be used to compress too, though not so well.

Partial forecasts can take various forms. They may skip out probability information. For example, Google Instant just lists the likely inputs, not their probabilities. However, the predictor - if it is behaving sensibly - usually has access to the probabilities. So, this kind of partial forecast is usually just missing out information the forecaster has access to anyway.

The other kind of partial forecast involves missing out some results. E.g. Google's auto-complete just shows you the top ten results. That doesn't reveal much about the probabilities for most of the available outcomes. Again, you could still use the revealed information to compress - for example, by assuming a universal prior distribution for the "missing" strings - but it wouldn't be very efficient.

There's a case to be made that some forecasters might want to specialize in making partial forecasts. This allows them to devote their attention to the few cases that they are interested in, and pay little attention to the rest.

Compressors can do something like that too, though. For instance, if the top symbol accounts for 98% of the probability distribution, then time may be better spent on speculatively going on to process the next symbol after that - before the symbol before it has been observed - rather than generating a more complete forecast for the current symbol - which probably would not be used anyway.

I think any asymmetry here is likely to be a fairly minor one. Often what is wanted is a general forecast - and focusing on specific paths can be dealt with by using tree pruning.

Decompression / reversibility

There are a couple of reasons for making sure a stream compressor used for forecasting purposes is reversible.

One is to allow memories to be replayed - and reevaluated at leisure.

Another is to do with exploring possible future trees. There, the far edge of a future forecast involves a branching tree of possible futures. To explore the twigs it is convienent to be able to undo observations. This saves the trouble of repeatedly backing up the tree.

This is roughly the equivalent of using a decompressor. Note though that being able to run information small distances forwards and backwards through the system is a more demanding requirement than is usually made of most decompressors.

Latency issues

Stream compressors used for real-time applications have a parameter known as latency - which measures the delay between accepting input symbols and producing output symbols. Greater latency can increase the degree of compression that is possible - by ensuring that more context is available to the compressor before it has to commit to producing any output symbols.

However, forecasting often doesn't work very much like that. A forecaster typically has to make the best forecasts that it can continuously in real time. Forecasting is a low-latency application.

A few more details about latency may be found here.

Buffering

Stream compressors can sometimes buffer data if it comes in too quickly - catching up when the situation becomes more predictabe or boring again.

Forecasters can't usually do that. If they don't have enough time to make a prediction before the data comes in, too bad: it necessarily goes with its best estimate so far.

The real time requirements forecasters usually have to work under mean that that something analogous to buffering is not normally practical.

High-level forecasts

A compressor normally just needs to predict the probabalities of the next possible symbol in the stream - so it can be encoded.

If you have a compressed model of some data, and are projecting it into the future, there are some other things you might like to know - besides what symbol comes next in the stream. Perhaps some of those can be extracted by iteratively making a longer range forecasts - but there are some other options.

If we stop and think for a moment about what happens in the human brain, it does not simply compress and store its raw data input streams in short term memory. Rather it uses a system of layered feature-extraction cells which preprocess the input streams into more managable high level constructs.

If we imagine the input stream at each level being compressed, then that provides a mechanism for answering high-level queries about the future of the input stream.

So, for example, if you are reading some english words, at a low level there is a bit map, above that edge detection, feaature detection, letter recognition, and recognition of words and word parts. Querying the high-level layers allows questions to be posed such as: S-C-R-E-? or I-HAVE-GOT-TO-??? - rather than forecasting the future at the level of low-level sensory inputs.

I think this illustrates that there are ways of making high-level predictions about the future of a stream of data in a manner that would not be required of a stream compresser tasked with just losslessly storing the raw sense data.

This particular example uses feature extraction - and then standard inductive inference. However, there may be other ways of making high level or long-term predictions involving direct access to the compressed model - and then doing things such as projecting trends, or proving invariants. If this sort of thing proves to be is useful, constraints associated with interfacing directly to the information in the compressed model may take us further away from reusing a standard stream compression component.

Occam-related issues

Another issue is whether small size is the most important issue, or whether there are other factors - such as scratch storage space usage, or processing time. I have discussed those issues in some detail in another essay: The One True Razor.

The brain

The human brain compresses its sensory inputs into memories as part of its day-to-day operation. It is also constantly forecasting what will happen next - in order to anticipate events (to give time to plan for them) and to predict the future consequences of the available possible actions.

It seems likely that it uses the same world model to perform all these actions. The compressor that produces the short-term memory stream is the same one that generates probability estimates for sense stream data used to anticipate future events and probability estimates used for calculating the consequences of possible actions.

Summary

Stream compression and forecasting are closely equivalent concepts. They are not exactly equivalent - but the differences are probably much less important than the similarilies are.

If only we knew how to build good quality, general purpose stream compressors, they would do a great job of making practical forecasts in a wide range of domains.

Tim Tyler | Contact | http://matchingpennies.com/