[Theora-dev] Questions, MMX and co.

Thu Aug 26 10:25:57 PDT 2004

Hi,

I have some experience with optmization since I'm part of the XviD team, 
so I thought I'd give you my view of things before you have to go through 
the same painful experience as we did (and do)...

On Wed, 25 Aug 2004, David Kuehling wrote:
> My two cents: if possible always select the variant at run-time.  With
> all the CPU variants currently on the market (MMX, MMXEXT, SSE, SSE2,
> 3DNOW and whatever...) it would be a terrible headache for binary Linux
> distributions to provide properly optimized packages.  Mplayer as an
> example is AFAIK completely run-time CPU detection based.  

MPlayer has a compile time switch to enable or disable run-time CPU 
detection. 
But anyway, many SIMD optmized projects check for CPU flags on the fly, 
and it works very nicely using function pointers, set in an init-phase or 
at first call. We also benchmarked the overhead, and even for tiny 
operations like 8x8 SAD it was negligible. I guess jump-prediction is 
working well these days. 

The my main point was to vote in favour of compiler _intrinsics_!
At XviD, we chose NASM as assembler for external ASM files, because it 
has a strong MACRO language and is available for many plattforms, 
including Windows and Linux, of course. Also, intrinsics weren't developed 
very far when we started. 
But there are at least two drawbacks, and if we had to decide again today, 
we might decide differently: 

1) External ASM files usually require an additional program except for the 
compiler. One extra program isn't much to ask, and we didn't think this 
was a problem, but now exactly the worst case happened: nasm is not 
available for 64bit AMD, so all our nice and compatible SIMD code isn't 
working in native 64bit mode. :-(
Inbetween, there were problems as well, when people had different version 
of nasm installed etc., and one was slightly buggy etc. No fun!

2) External assemblers usually don't provide debugging/profiling 
information (at least not both). That turned out to be a big point 
for us when doing time-critical optmization in realistic environment: 
You have to know where the hotspots are, and what exactly cause them 
(arithmetics or mem transfer) in order to optimize further. 

Inline assembler is plain ugly to support, especially with the problem of
AT&T/Intel syntax. I guess that everybody agrees on.

> I thought that at least the (I)DCT should be bit-perfectly equal to the
> reference encoder.  Else you will have terrible artifacts if people
> encode movies with large keyframe distances (I already encoded
> Theora-movies with keyframes spaced 512 frames apart).

(i)DCT is a bad choice for bit-exact implementation. AFAIK that's 
exactly the reason why MS WMV9 and H.264 switched to 16-bit integer 
transforms. You cannot ask for bit-exact decoding of DCT, because SIMD 
and floating point and fixed point implementation typically do differ a 
little, and there simply is too many possible input values to avoid all 
errors. I don't know if bit exact e.g. MMX iDCT would even be possible, 
but it certainly wouldn't be fast. 

There are many sources on how a "good" iDCT is supposed to look, how 
much it "should" differ from the reference one. ffmpeg e.g. has a 
whole bunch of iDCTs implemented, and except for the libmpeg2 
(which shouldn't be used anymore), they all "look the same", even for long 
keyframe distances. 

On the encoder side, typically you don't care about which transform is 
used how well, since transform differences just lead to lower or higher 
efficiency, not to error drift at the decoder side, and how efficient an 
encoder is, is up to whoever creates it. 

chl