[theora-dev] What sort of math i required?

Fri Dec 12 09:26:00 PST 2003

There are two example wavelet codecs CVS, tarkin and w3d. IIRC, w3d is 
the more recent one, but they both have been untouched for along time 
now. That said, there are things that may perform better than wavelets, 
such as lapped transforms.

Also, one need not throw away block motion estimation when using 
wavelets. The motion estimation and residual coding stages can be made 
completely independent. Whether or not that's a good idea is another story.

> 	Funny you should ask. Over on the ffmpeg-devel list, we were
> recently tossing around some alternative compression methods and BWT came
> up. Possibly by itself or in combination with other techniques (e.g., BW
> transform the differentially-encoded DC coeffs).

The problem most people don't seem to realize, and that David Kuehling 
demonstrated admirably, is that an entropy encoder does not exist in a 
vacuum. Shannon's theorem says the best you can do is to store symbols 
with bits proportional to the negative log of their probability. What it 
does NOT say is how to estimate that probability.

There isn't much improvement you can get on the actual entropy encoding 
side. Huffman is cheap and free. Arithmetic coding is good and slow, and 
may eventually be free (original patent was 1979, IIRC; I don't know 
what other patents, if any, on specific implementation techniques would 
also need to lapse before it becomes usable), and there are fast 
variations of arithmetic coding that do not require multiplications and 
divisions at the loss of some efficiency. Pick whatever meets your 
requirements.

The rest of the things are just different probability models. This has 
been well understood since the early 80's. LZ77, being developed before 
then, does not have an explicit probability model-entropy encoder 
separation. But bzip2, for example, despite being based on the BWT, uses 
Huffman encoding in the back-end. The two stages are independent. The 
BWT has been shown to approach theoretically optimal compression faster 
than LZ77-type algorithms, but only under some rather restrictive 
assumptions, such as having the data generatd by a FSM (Finite State 
Machine) source. Such sources model text very well, but video data has 
more structure to it that can be exploited.

Current research is mostly focused on the context approach. What this 
means is that by using the structure of the data (is this a sign bit, is 
it a piece of a motion vector, is it a run value, etc.) and previously 
decoded data one chooses a "context" in which to encode the next symbol. 
Statistics for different contexts are kept entirely independent. This 
method has been very successful, but the number of symbols encoded in 
each context is so small that it is unlikely one can do much better than 
assuming an order-0 probability model (e.g., all symbols in the context 
are independent of each other).

The BWT is also a context-based method. The states of the unknown FSM 
model are the contexts. The idea behind the BWT is that symbols that 
occur in the same state are sorted to lie next to each other. The only 
problem is that one does not know when one has moved from one context to 
another, so there is a bitrate penalty while the model parameters adapt 
to the statistics of the new context.

PPM techniques model these contexts explicitly and thus avoid these 
penalties, but require a much larger amount of memory because they keep 
statistics for not just the actual states in the source model, but every 
potential state.

Finally, there are weighting schemes that avoid the requirement of 
choosing a particular context, but instead weight the statistics of 
several different contexts. This is useful because one seldom has an 
explicit enough model of the source to be able to truly say that symbols 
in different contexts are completely independent. These techniques, 
however, are very, very slow, because one must generate a completely new 
statistical distribution for every symbol encoded.

It might be possible to come up with a context scheme where symbols in a 
given context benefitted from a more complex model than an order-0 one. 
In general this requires a lot of symbols in each context, because there 
is usually a bitrate penalty for every model parameter that must be 
estimated (on the order of log(N) bits, where N is the number of symbols).

To my knowledge, no one has done this. The question this really asks is, 
"is it better to try to learn the underlying (optimal) context model or 
to use domain knowledge to select a nearly-optimal model?" The first 
approach is more flexible, and pays a bitrate penalty for that 
flexibility. The second approach doesn't have this penalty, but it 
wastes bits due to it's sub-optimality. So far domain knowledge is 
winning over universality, which is what one would expect when one knows 
a lot about the domain.

> 	It all seemed to come from the realization that all these video
> codecs that we've become familiar with all seem to rely on the same small
> clique of coding techniques.

Commerical codec technology usually lags far behind theoretical 
research. With good reason... as long as it's theoretical no one 
honestly knows whether or not it will be better in practice.

--- >8 ----
List archives:  http://www.xiph.org/archives/
Ogg project homepage: http://www.xiph.org/ogg/
To unsubscribe from this list, send a message to 'theora-dev-request at xiph.org'
containing only the word 'unsubscribe' in the body.  No subject is needed.
Unsubscribe messages sent to the list will be ignored/filtered.