[Theora-dev] Questions, MMX and co.
chl at math.uni-bonn.de
Thu Aug 26 10:25:57 PDT 2004
I have some experience with optmization since I'm part of the XviD team,
so I thought I'd give you my view of things before you have to go through
the same painful experience as we did (and do)...
On Wed, 25 Aug 2004, David Kuehling wrote:
> My two cents: if possible always select the variant at run-time. With
> all the CPU variants currently on the market (MMX, MMXEXT, SSE, SSE2,
> 3DNOW and whatever...) it would be a terrible headache for binary Linux
> distributions to provide properly optimized packages. Mplayer as an
> example is AFAIK completely run-time CPU detection based.
MPlayer has a compile time switch to enable or disable run-time CPU
But anyway, many SIMD optmized projects check for CPU flags on the fly,
and it works very nicely using function pointers, set in an init-phase or
at first call. We also benchmarked the overhead, and even for tiny
operations like 8x8 SAD it was negligible. I guess jump-prediction is
working well these days.
The my main point was to vote in favour of compiler _intrinsics_!
At XviD, we chose NASM as assembler for external ASM files, because it
has a strong MACRO language and is available for many plattforms,
including Windows and Linux, of course. Also, intrinsics weren't developed
very far when we started.
But there are at least two drawbacks, and if we had to decide again today,
we might decide differently:
1) External ASM files usually require an additional program except for the
compiler. One extra program isn't much to ask, and we didn't think this
was a problem, but now exactly the worst case happened: nasm is not
available for 64bit AMD, so all our nice and compatible SIMD code isn't
working in native 64bit mode. :-(
Inbetween, there were problems as well, when people had different version
of nasm installed etc., and one was slightly buggy etc. No fun!
2) External assemblers usually don't provide debugging/profiling
information (at least not both). That turned out to be a big point
for us when doing time-critical optmization in realistic environment:
You have to know where the hotspots are, and what exactly cause them
(arithmetics or mem transfer) in order to optimize further.
Inline assembler is plain ugly to support, especially with the problem of
AT&T/Intel syntax. I guess that everybody agrees on.
> I thought that at least the (I)DCT should be bit-perfectly equal to the
> reference encoder. Else you will have terrible artifacts if people
> encode movies with large keyframe distances (I already encoded
> Theora-movies with keyframes spaced 512 frames apart).
(i)DCT is a bad choice for bit-exact implementation. AFAIK that's
exactly the reason why MS WMV9 and H.264 switched to 16-bit integer
transforms. You cannot ask for bit-exact decoding of DCT, because SIMD
and floating point and fixed point implementation typically do differ a
little, and there simply is too many possible input values to avoid all
errors. I don't know if bit exact e.g. MMX iDCT would even be possible,
but it certainly wouldn't be fast.
There are many sources on how a "good" iDCT is supposed to look, how
much it "should" differ from the reference one. ffmpeg e.g. has a
whole bunch of iDCTs implemented, and except for the libmpeg2
(which shouldn't be used anymore), they all "look the same", even for long
On the encoder side, typically you don't care about which transform is
used how well, since transform differences just lead to lower or higher
efficiency, not to error drift at the decoder side, and how efficient an
encoder is, is up to whoever creates it.
More information about the Theora-dev