Forecasting as compression
Abstract
Intelligent agents need to forecast future events - to understand the
potential consequences of their possible actions.
Forecasting has a close link with stream compression. However, that link is
not necessarily obvious. This page gives details about the extent to which
forecasting is equivalent to stream compression.
Compression - the basics
The basics of the analogy between stream compression and forecasting is based
on the details of how stream compression systems work. Such systems develop a
model of the stream they are compressing - and then use a coding scheme to
represent that symbol compactly in an output stream.
Various coding schemes exist:
Huffman coding,
Arithmetic coding - and so on.
Many of these are close to optimal. The interesting part of stream compression
systems lies in the model of the stream.
The output of the model is the probability of the next symbol being seen -
as it would have been calculated before it was observed. It is that symbol's
surprise value - from the perspective of the compression system.
Unsuprising symbols are then assigned short output codes, and suprising
symbols are given longer ones. With an accurate data model, this results in
compression of the input stream.
The link to forecasting
From this description, the link to forecasting is fairly easily seen. The
compression system forms an estimate of the probability of each symbol being
observed. Since it could potentially observe any symbol, it
must be able to form an estimate for each possible symbol. That is a
forcast forward of one step. This forecast can then be extended out into the
future - by iterating the process - by imagining streams of input data, and
calculating their cumulative probabilities. This process provides a
complete forecast, potentially covering the entire range of future
possibilities.
The analogy works the other way around too. If you have a complete
forecasting system, you can use that to construct a stream compressor, based
on its forecasted probabilities.
Partial forecasts
If you have a complete forecast, you can use it to compress.
However, in practice, many forecasts are partial. Partial
forecasts can be used to compress too, though not so well.
Partial forecasts can take various forms. They may skip out
probability information. For example, Google Instant just lists
the likely inputs, not their probabilities. However, the predictor - if
it is behaving sensibly - usually has access to the probabilities. So,
this kind of partial forecast is usually just missing out
information the forecaster has access to anyway.
The other kind of partial forecast involves missing out some
results. E.g.
Google's auto-complete just shows you the top ten results. That
doesn't reveal much about the probabilities for most of the available
outcomes. Again, you could still use the revealed information to compress
- for example, by assuming a universal prior distribution for the
"missing" strings - but it wouldn't be very efficient.
There's a case to be made that some forecasters might want to
specialize in making partial forecasts. This allows them to
devote their attention to the few cases that they are interested in, and
pay little attention to the rest.
Compressors can do something like that too, though. For instance, if the
top symbol accounts for 98% of the probability distribution, then time
may be better spent on speculatively going on to process the next symbol
after that - before the symbol before it has been observed - rather than
generating a more complete forecast for the current symbol - which
probably would not be used anyway.
I think any asymmetry here is likely to be a fairly minor one.
Often what is wanted is a general forecast - and focusing on specific
paths can be dealt with by using tree pruning.
Decompression / reversibility
There are a couple of reasons for making sure a stream compressor
used for forecasting purposes is reversible.
One is to allow memories to be replayed - and reevaluated at leisure.
Another is to do with exploring possible future trees. There, the far
edge of a future forecast involves a branching tree of possible futures.
To explore the twigs it is convienent to be able to undo observations.
This saves the trouble of repeatedly backing up the tree.
This is roughly the equivalent of using a decompressor. Note though that
being able to run information small distances forwards and backwards
through the system is a more demanding requirement than is usually made
of most decompressors.
Latency issues
Stream compressors used for real-time applications have a parameter known
as latency - which measures the delay between accepting input symbols and
producing output symbols. Greater latency can increase the degree of
compression that is possible - by ensuring that more context is available
to the compressor before it has to commit to producing any output
symbols.
However, forecasting often doesn't work very much like that. A forecaster
typically has to make the best forecasts that it can continuously in real
time. Forecasting is a low-latency application.
A few more details about latency may be found
here.
Buffering
Stream compressors can sometimes buffer data if it comes
in too quickly - catching up when the situation becomes more predictabe or
boring again.
Forecasters can't usually do that. If they don't have enough time to make
a prediction before the data comes in, too bad: it necessarily goes with
its best estimate so far.
The real time requirements forecasters usually have to work under mean
that that something analogous to buffering is not normally practical.
High-level forecasts
A compressor normally just needs to predict the probabalities of
the next possible symbol in the stream - so it can be encoded.
If you have a compressed model of some data, and are projecting it into
the future, there are some other things you might like to know - besides
what symbol comes next in the stream. Perhaps some of those can
be extracted by iteratively making a longer range forecasts - but there
are some other options.
If we stop and think for a moment about what happens in the human brain,
it does not simply compress and store its raw data input streams in short
term memory. Rather it uses a system of layered feature-extraction
cells which preprocess the input streams into more managable high level
constructs.
If we imagine the input stream at each level being compressed, then that
provides a mechanism for answering high-level queries about the future
of the input stream.
So, for example, if you are reading some english words, at a low level
there is a bit map, above that edge detection, feaature detection, letter
recognition, and recognition of words and word parts. Querying the
high-level layers allows questions to be posed such as:
S-C-R-E-? or I-HAVE-GOT-TO-??? - rather than
forecasting the future at the level of low-level sensory inputs.
I think this illustrates that there are ways of making high-level
predictions about the future of a stream of data in a manner that
would not be required of a stream compresser tasked with just
losslessly storing the raw sense data.
This particular example uses feature extraction - and then
standard inductive inference. However, there may be other ways
of making high level or long-term predictions involving direct access to
the compressed model - and then doing things such as projecting trends,
or proving invariants. If this sort of thing proves to be is
useful, constraints associated with interfacing directly to the
information in the compressed model may take us further away from reusing
a standard stream compression component.
Occam-related issues
Another issue is whether small size is the most important issue, or
whether there are other factors - such as scratch storage space usage, or
processing time. I have discussed those issues in some detail in another
essay: The One True Razor.
The brain
The human brain compresses its sensory inputs into memories as part of
its day-to-day operation. It is also constantly forecasting what
will happen next - in order to anticipate events (to give time to plan
for them) and to predict the future consequences of the available
possible actions.
It seems likely that it uses the same world model to perform all
these actions. The compressor that produces the short-term memory stream
is the same one that generates probability estimates for sense stream
data used to anticipate future events and probability estimates used for
calculating the consequences of possible actions.
Summary
Stream compression and forecasting are closely equivalent
concepts. They are not exactly equivalent - but the differences
are probably much less important than the similarilies are.
If only we knew how to build good quality, general purpose stream
compressors, they would do a great job of making practical forecasts in a
wide range of domains.
Tim Tyler |
Contact |
http://matchingpennies.com/
|