[theora-dev] VC2005 MMX patch.

Wed Dec 26 09:55:06 PST 2007

Hi Timothy,
> I notice your patch does not use the port to intrinsics you said you
> did, except for a few small bits in mmxstate.c (and thus, everything
> else will not support x86-64). Did you test the speed of the intrinsics
> version against your hand-rolled version? What were the results?
>   
The intrinsic versions are roughly 1.5 times slower than the assembler 
version. Visual C generates lots of unnecessary register moves and is 
not very clever at scheduling loads early. This is not much of a problem 
for large and high bit rate files since these are dominated by cache 
misses, but it starts to make a difference for small sized videos.

The main problem is that in the fragment reconstructions (for example) 
we often need constants of zero. Even if I preload a MMX register with 
zero MSVC will create a temporary copy also the contents of this 
register will never be destroyed. The extra instruction is something we 
could live with, but the increased register pressure cause slow code to 
be generated.

I used the intrinsics in the mmxstate.c because the compiler did a good 
job at interleaving/inlining the MMX parts into the ordinary integer 
instruction flow. Here it is a real benefit.

Regarding x86-64: The MSVC helpfile tells me that the MMX intrinsics 
with 64 bit operands won't be supported for the 64 bit compiler, so 
additional work is required anyways  (for whatever reason they decided 
to do that).

> I also notice you made lots of minor changes, which will make it more
> difficult to keep the code in sync with the gcc version. 
I only did logical changes in the files that are for visual studio. The 
few changes in the other files where required to get a theora building 
with the USE_ASM macro.

If this becomes a problem I'd be happy to merge it into the gcc tree.

> I'd like to
> keep things as consistent as possible. E.g., what's the rational for
> expanding out all of the macros for the IDCT, other than, "it was easier
> that way"? Does MSVC really not unroll loops with inline asm in them for
> you? 
You hit the nail on the head:

MSVC does neither inline any function that contains raw assembler nor 
does it allow macro expansion inside assembler. The IDCT source from GCC 
use macro expansion a lot, so a real port was undoable.  I did a object 
file dump conversion therefore. I know - it's ugly. If someone has a 
better idea how to do it let me know.

> I'm also confused by your bit-twiddling average:
>
>               average = (a & b) + (((a ^ b) & 0xfe) >> 1);
>
> What on earth is the purpose of the AND if you're just going to shift
> off the lower bit anyway?
>   
Simply because the code above is for a single byte. It extends well to a 
full machine-word as well, and in this case you need the AND to prevent 
the LSB of byte1 to shift into MSB of byte0. This makes only sense if 
you process more than one byte at a time of cause.

The 32 bit version of the above function should make it clear (hint 
hint: this would be a easy to do improvement for oc_frag_recon_inter2_c)

ogg_uint32_t pavgub4 (ogg_uint32_t a, ogg_uint32_t b)
{
  return (a & b) + (((a ^ b) & 0xfefefefe) >> 1);
}

I'll do a measurement how it performs against the "add and 
shift"-version from the gcc sources.. Just to be sure that it is faster.

-----

Here are some benchmark results btw:

1.3Ghz Athlon. Profiling with AMD Code Analyst. I used dumpvid to decode 
a large, high resolution, high quality ogg and sent the output to dev/null

Samples in libtheora.dll:

With MMX: 74660
Without  MMX: 126909

Overall performance gain: ~ 1.7

Top ten cycle-eaters for the MMX build:

oc_frag_recon_inter2_mmx                    7435             
oc_dec_frags_recon_mcu_plane                6856             
loop_filter_h4                              5177             
oc_state_frag_recon_mmx                     4485             
oc_dec_ac_coeff_unpack                      4189             
oggpackB_look                               3136             
oc_huff_token_decode                        2855             
oc_dec_coded_flags_unpack                   2799             
oc_frag_pred_dc                             2735             
oc_state_frag_copy_mmx                      2606

Top ten cycle-eaters for the Non-MMX build:

oc_frag_recon_inter2_c                      26971            
idct8                                       11982            
loop_filter_h                               7973             
loop_filter_v                               7745             
oc_frag_recon_inter_c                       6821             
oc_dec_frags_recon_mcu_plane                6485             
oc_state_frag_recon_c                       5886             
oc_dec_ac_coeff_unpack                      4154             
oggpackB_look                               3268             
idct8_4                                     2971 

My guess is that the P4 architecture will benefit even more from the MMX 
port.

Nils