[theora-dev] Benchmarks Inline-ASM vs. Intrinsics

Wed Feb 11 05:47:35 PST 2009

Hi folks, FYI:

I've finally made some benchmarks for inline-assembler versus intrinsic 
based mmx code.

I've just applied the changes to the fragment reconstruction functions 
as writing the IDCT and loopfilter have not been ported yet. 
Nevertheless here are some numbers:

As a baseline I'll take the current version from the trunk with all 
inline assembler functions enabled. Lower values mean lower performance.

    All functions with inline-asm:           100%     
    inter_mmx replaced by C-function:    93%
    no mmx at all:                                      60%
    all oc_frag functions intrinsic based:   98%

As you can see the current bugfix for mozilla just takes a 7% 
performance hit. Imho that's something we could live with. The intrinsic 
based approach is nearly as good as the handwritten code, and it 
compiles with gcc as well as VS.net (haven't tried it under linux yet, 
but will do so...). The gcc generated code is even a tad better than the 
vs.net one.

There is btw. a difference between VS.net whole program optimization or 
simple per translation unit optimization, but the performance difference 
is so small that it's nearly lost in the measurement noise. Moving the 
mmx intrinsic functions into the mmxstate.c file and declaring them as 
static inline made a bigger difference (still neglible).

Cheers,
  Nils