[daala] Curious about progress of codec

Mon Apr 25 04:58:52 UTC 2016

On 04/24/2016 10:02 PM, Jarek Duda wrote:
> 1) First of all, I don't understand why you insist flat initial
> probability for every frame, not I-frame as you still need it for P,B
> frames?

As much as I would like to have the code adapt its probabilities from
frame to frame, it is just not a viable option for any application (e.g.
videoconference) that is based on unreliable transport. In that case,
losing a single packet means your stream is completely undecodable until
the next keyframe, which is unacceptable. We may the feature as an
option for applications that do have reliable transport, but we just
can't rely on it everywhere.

> Writing entire initial probability distribution every time doesn't seem
> reasonable.

Indeed, that's why we never write initial probabilities.

> Instead, you can use a parametric distribution (e.g. geometric) and just
> write the (quantized) parameter.

We have used parametric distributions for some things, but so far in all
cases I've been able to beat the parametric distribution by using an
adaptive one (even with flat initialization). Don't get me wrong -- I'd
much rather use something parametric with tables stored in ROM. But so
far it's always been a bit of a loss over adapting.

> Or don't write it down, only fix the initial probability for given ID in
> the standard (averaged) - especially to emphasize the fact that your '0'
> symbol has often dominating probability.

This is something we have considered, but it's a lot of work. Right now
we're only doing it for motion vectors (initializing with a high
probability of zero), but doing it for all symbols would require
recomputing the initial probabilities on every change to the bit-stream.
My estimate is that if we were to include an initial probability, we
could save in the order of 1-2% in Daala, which is something you might
want to do just before freezing a bit-stream, but not during development.

> Your current main purpose of expensive symbol-wise adaptation is to
> quickly get out of this terrible flat distribution.
> In contrast, its main purpose should be to adapt to local variations of
> density - which should remain close to the initial density.

What we have now actually does both. The best way to consider it is that
at the beginning you just have an average of what you saw so far, and
eventually it starts "forgetting" and behaving like what you suggest.

> 2) Good to hear that you are handling your entropy coding problem.
> If you have an unknown accurate multi-symbol entropy coder, it would be
> beneficial to share with the society and compare with others, e.g.:
> http://encode.ru/threads/1920-In-memory-open-source-benchmark-of-entropy-coders?p=45168&viewfull=1#post45168

I don't think what we have for power-of-two probabilities is by any
means ground-breaking or anything. We are basically scaling the
probabilities by the range, i.e. using a multiplication. When it comes
to non-powers of two, it's a bit more of a hack and currently requires
two 8-bit multiplications.

> 3) Again your count[i]/totalcount adaptation is good for quickly getting
> out of the terrible flat distribution.
> For the actual reason why adaptation is usually used: adapting to local
> variation of density, it is crucial to use exponential forgetting - make
> that local behavior has larger influence than some far history - what is
> much weaker in your adaptation.

We *are* using exponential forgetting right now.

> Regarding the accuracy issue of this cheap SIMD-able exponential
> adaptation:
> 
> for (int i = 1; i < m; i++) CDF[i] -= (CDF[i] - mixCDF[i]) >> rate;
> 
> there is none if properly choosing mixCDF to make that not used symbol
> lead to the lowest possible nonzero frequency (can be tabled as
> mixCDF[s,i]):
> 
> CDF[0] = 0
> CDF[m] = 2^n     // denominator,  n=14 or 15 for  16bit rANS
> renormalization
> rate >= 1            // the larger, the slower adaptation
> s = 0 .. m-1       // symbols, Pr(s) ~ (CDF[s+1] - CDF[s])/2^n
> step = 2^n - m + 2^rate -1     // so that mixCDF[m-1] = 2^n-1
> 
> for (int i=1; i<m; i++) mixCDF[i] = i - (1 << rate);        // to lead
> to minimal: CDF[s+1] - CDF[s] = 1
> for (int i = s+1; i < m; i++) mixCDF[i] += step;          // CDF for
> single symbol s, jump from s+1
> 
> You can test it with this Mathematica notebook:
> https://dl.dropboxusercontent.com/u/12405967/adapt.nb

Sorry, I don't have Mathematica, I use Octave and C! In any case, I
think you have an off-by-one error in your mixCDF[] but I've been able
to work out the math and as far as I can tell, it's mathematically
equivalent(*) to the what we did in
https://github.com/xiph/daala/commit/3cd606c24af
but moving all the floor computation in mixCDF[], which safes one add.

(*) The only difference is the (1 << rate) offset, which I think should
be (1 << rate) -1 and avoids a large floor on the first symbol
(something I did not have).

> 4) My suggestion was to allow encoder to choose among a few rates, e.g.
> use 2 bits to choose from {3,4,5,6}.
> Encoder can cheaply test all of them (sum of log(1/p)) and choose the
> one leading to the smallest number of required bits - the differences
> are usually much larger than 2 bits to choose the rate.

It's not that simple. You want your encoder to make RDO decisions based
on what it'll cost to code a particular symbol in a particular context.
Either you re-encode (with different RDO decisions) for each rate, or
else you have to live with suboptimal RDO. Also, in general it's nice to
be able to encode symbols as you go rather than having to buffer the
entire list of symbols and encode everything at the end. It doesn't make
a difference with rANS since you already have to do that, but the Daala
entropy coder allows you to write the bit-stream as you encode it.

Cheers,

	Jean-Marc