[Theora-dev] Changing the IDCT spec

Fri Feb 11 11:28:33 PST 2005

So, in preparation for some decoder optimization work planned by Rudolf 
  Marek, the subject of the size of the registers needed in the IDCT 
came up.

The current spec language ensures that the result is exactly compatible 
with the C code for VP3. This language requires that some of the 
arguments to the multiplies be 17 or 18 bits, because they need to hold 
the sum or difference of two 16-bit numbers. This necessitates using 
32-bit registers, which greatly reduces potential parallelism for SIMD 
instructions (not to mention making an implementation much more 
complicated on embedded chipsets with 16-bit registers).

However, upon reviewing VP3's own MMX routines, I discovered that they 
used 16-bit registers anyway. Thus, in VP3 where the code IS the spec, 
the code still doesn't match the spec.

Now, I want to emphasize, in practical terms, the differences have very 
little real effect. Given normal pixel values, the resulting DCT 
coefficients should not even come close to overflowing the registers 
during the IDCT (there are about 3 bits to spare). Even with some pretty 
severe quantization errors, that seems to be enough headroom.

However, the specification does not specify the encoder's operation, it 
specify the decoder's. It is possible to store coefficients in the 
bitstream that would cause overflow, and we need to standardize what to 
do in such cases. When I wrote the section in the spec, I took the 
approach of "do what the code does, no matter how much it hurts 
optimization", but knowing now that the code does two different things, 
we have a choice.

However, the spec has now been included in an official release (alpha4), 
and I know several people have begun or completed independent 
implementations (e.g., Andrey Filippov's FPGA encoder, Robert 
Brautigam's Java port, for sure, and I remember some talk of a DSP stamp 
between either Aaron Colwell and the Fluendo folks). So I don't want 
make such a significant change to the language of the spec without 
soliciting input from the people it will affect.

So, to summarize, there are two choices:
1) Truncate the result of each intermediate step in the IDCT to 16 bits, 
providing for better SIMD and 16-bit architecture optimization, but 
requiring slightly more work in a 32-bit C implementation, or
2) Keep the current language, allowing some intermediate results to grow 
to 17 or 18 bits, requiring 32-bit registers.

Either choice should have no real effect on content encoded with any of 
the encoders I am aware of. Both are equally compatible with existing 
VP3 content, as different VP3 codepaths follow both approaches. If 
anything, the first approach is probably used more often since most PCs 
from the last 9 years have had some kind of MMX support.

Thoughts? Opinions?