[Theora-dev] Questions, MMX and co.

Wed Aug 25 11:40:34 PDT 2004

Thanks to all those who do work on MMX (I do not count me among them for the 
little compile&run I've done). However, looking at the various implementation 
paths, it seems to me it would be nice to answer the following 
design/development questions before competition among different variants of 
the same code starts to pollute the debate. (Speed issues *always* generate 
flamewars in the end... :-)

1) Should the C, MMX, MMXEXT, SSE (and possibly later on SSE3 or SSE4) 
variants of functions be:
1-A) selected at compile time (via #ifdef or compiler flags), like what 
HoSwiYO did for the decoder, or me last year: one binary version for each;
1-B) all available simultaneously in the library and be selected at run time 
(thus, probably using the (*funcPointer)(a,b) approach like Wim did in his 
encoder patch);
1-C) more complex solution (ideas? dllopen()?)

2) Which compiler should be supported?

3) What is preferred:
3-A) inline assembly,
3-B) (x)mmintrin.h-based MMX functions (Intel compiler, GCC, maybe others)?

4) How to benchmark the implementation? (I'm still using the small wav+yuv 
video with this cute little girl singing but I guess something more serious 
should be done...) If possible, it should be easily accessible to everyone 
(no expensive digital equipment, no multi-gigabyte downloads) so that 
everyone could reproduce the test and compare results.

5) How to *validate* the implementation? It is probably easy to introduce 
biais in the encoder in the C to MMX transformation process. On the other 
hand, sometimes output is different and this is normal (psavgb does (A+B+1)/2 
which is more precise than (A+B)/2). Maybe we can simply trust experimental 
work, or rely on a good 4), but then...?

6) Who has the definitive answer on the above questions: or btw, who rules 
Xiph/Theora? :-)

Rodolphe

My own 0.02:

Note that 1&3 affect performance: IMHO, 1-A + 3-B is the maximal performance 
gain. (But then, I'm GCC-centric, and that's the usual GCC way: source code 
is available and maintainance is done at night... :-)

Too, I'd say that 2) should only include "GCC >=3.4" but that's definitely 
extremely selfish... It's just that the Intel compiler is already better than 
GCC (wrt perf. of generated code) so; let's handicap him a little! :-)

For four :-), I'd say that maybe we should select a few tracks of common video 
DVD (commercial ones) that possibly everyone in the computer development 
business already has bought (like one of the Lord of the rings, or the Matrix 
trilogy, I'd bet that among these 6 everyone already has one) and publish a 
few scripts for transcoding the selected tracks and isolating Theora video 
encoding cpu time.

Concerning 5, well, I had tried a framework for back to back testing of C and 
assembly implementation last year (ie: execute at runtime *both* the C 
function and the asm one with the same parameters and compare results 
accuracy. But, that's another level of #ifdef definitions. Well, it caught 
some bugs for me but I wonder if that was worth the effort... No opinion. I'd 
rather trust 4.

Concerning 6, would you trust voting machines that use punch cards, or have a 
MMX-optimized crypto engine? :-)))

On Wednesday 25 August 2004 00:37, VP3HoSwiYO wrote:
> good morning everybody.
> I have finished converting mmx decode code to gcc.
> Now this can be compiled with both vc and gcc.
>
> http://kyoto.cool.ne.jp/vp3/developers/theora-a3-MMXd.zip
>
>
> VP3HoSwiYO