[theora-dev] What sort of math i required?

Timothy B. Terriberry tterribe at vt.edu
Fri Dec 12 09:26:00 PST 2003

```
There are two example wavelet codecs CVS, tarkin and w3d. IIRC, w3d is
the more recent one, but they both have been untouched for along time
now. That said, there are things that may perform better than wavelets,
such as lapped transforms.

Also, one need not throw away block motion estimation when using
wavelets. The motion estimation and residual coding stages can be made
completely independent. Whether or not that's a good idea is another story.

> 	Funny you should ask. Over on the ffmpeg-devel list, we were
> recently tossing around some alternative compression methods and BWT came
> up. Possibly by itself or in combination with other techniques (e.g., BW
> transform the differentially-encoded DC coeffs).

The problem most people don't seem to realize, and that David Kuehling
demonstrated admirably, is that an entropy encoder does not exist in a
vacuum. Shannon's theorem says the best you can do is to store symbols
with bits proportional to the negative log of their probability. What it
does NOT say is how to estimate that probability.

There isn't much improvement you can get on the actual entropy encoding
side. Huffman is cheap and free. Arithmetic coding is good and slow, and
may eventually be free (original patent was 1979, IIRC; I don't know
what other patents, if any, on specific implementation techniques would
also need to lapse before it becomes usable), and there are fast
variations of arithmetic coding that do not require multiplications and
divisions at the loss of some efficiency. Pick whatever meets your
requirements.

The rest of the things are just different probability models. This has
been well understood since the early 80's. LZ77, being developed before
then, does not have an explicit probability model-entropy encoder
separation. But bzip2, for example, despite being based on the BWT, uses
Huffman encoding in the back-end. The two stages are independent. The
BWT has been shown to approach theoretically optimal compression faster
than LZ77-type algorithms, but only under some rather restrictive
assumptions, such as having the data generatd by a FSM (Finite State
Machine) source. Such sources model text very well, but video data has
more structure to it that can be exploited.

Current research is mostly focused on the context approach. What this
means is that by using the structure of the data (is this a sign bit, is
it a piece of a motion vector, is it a run value, etc.) and previously
decoded data one chooses a "context" in which to encode the next symbol.
Statistics for different contexts are kept entirely independent. This
method has been very successful, but the number of symbols encoded in
each context is so small that it is unlikely one can do much better than
assuming an order-0 probability model (e.g., all symbols in the context
are independent of each other).

The BWT is also a context-based method. The states of the unknown FSM
model are the contexts. The idea behind the BWT is that symbols that
occur in the same state are sorted to lie next to each other. The only
problem is that one does not know when one has moved from one context to
another, so there is a bitrate penalty while the model parameters adapt
to the statistics of the new context.

PPM techniques model these contexts explicitly and thus avoid these
penalties, but require a much larger amount of memory because they keep
statistics for not just the actual states in the source model, but every
potential state.

Finally, there are weighting schemes that avoid the requirement of
choosing a particular context, but instead weight the statistics of
several different contexts. This is useful because one seldom has an
explicit enough model of the source to be able to truly say that symbols
in different contexts are completely independent. These techniques,
however, are very, very slow, because one must generate a completely new
statistical distribution for every symbol encoded.

It might be possible to come up with a context scheme where symbols in a
given context benefitted from a more complex model than an order-0 one.
In general this requires a lot of symbols in each context, because there
is usually a bitrate penalty for every model parameter that must be
estimated (on the order of log(N) bits, where N is the number of symbols).

To my knowledge, no one has done this. The question this really asks is,
"is it better to try to learn the underlying (optimal) context model or
to use domain knowledge to select a nearly-optimal model?" The first
approach is more flexible, and pays a bitrate penalty for that
flexibility. The second approach doesn't have this penalty, but it
wastes bits due to it's sub-optimality. So far domain knowledge is
winning over universality, which is what one would expect when one knows

> 	It all seemed to come from the realization that all these video
> codecs that we've become familiar with all seem to rely on the same small
> clique of coding techniques.

Commerical codec technology usually lags far behind theoretical
research. With good reason... as long as it's theoretical no one
honestly knows whether or not it will be better in practice.

--- >8 ----
List archives:  http://www.xiph.org/archives/
Ogg project homepage: http://www.xiph.org/ogg/
To unsubscribe from this list, send a message to 'theora-dev-request at xiph.org'
containing only the word 'unsubscribe' in the body.  No subject is needed.
Unsubscribe messages sent to the list will be ignored/filtered.

```