[theora-dev] VC2005 MMX patch.
Nils Pipenbrinck
n.pipenbrinck at cubic.org
Wed Dec 26 09:55:06 PST 2007
Hi Timothy,
> I notice your patch does not use the port to intrinsics you said you
> did, except for a few small bits in mmxstate.c (and thus, everything
> else will not support x86-64). Did you test the speed of the intrinsics
> version against your hand-rolled version? What were the results?
>
The intrinsic versions are roughly 1.5 times slower than the assembler
version. Visual C generates lots of unnecessary register moves and is
not very clever at scheduling loads early. This is not much of a problem
for large and high bit rate files since these are dominated by cache
misses, but it starts to make a difference for small sized videos.
The main problem is that in the fragment reconstructions (for example)
we often need constants of zero. Even if I preload a MMX register with
zero MSVC will create a temporary copy also the contents of this
register will never be destroyed. The extra instruction is something we
could live with, but the increased register pressure cause slow code to
be generated.
I used the intrinsics in the mmxstate.c because the compiler did a good
job at interleaving/inlining the MMX parts into the ordinary integer
instruction flow. Here it is a real benefit.
Regarding x86-64: The MSVC helpfile tells me that the MMX intrinsics
with 64 bit operands won't be supported for the 64 bit compiler, so
additional work is required anyways (for whatever reason they decided
to do that).
> I also notice you made lots of minor changes, which will make it more
> difficult to keep the code in sync with the gcc version.
I only did logical changes in the files that are for visual studio. The
few changes in the other files where required to get a theora building
with the USE_ASM macro.
If this becomes a problem I'd be happy to merge it into the gcc tree.
> I'd like to
> keep things as consistent as possible. E.g., what's the rational for
> expanding out all of the macros for the IDCT, other than, "it was easier
> that way"? Does MSVC really not unroll loops with inline asm in them for
> you?
You hit the nail on the head:
MSVC does neither inline any function that contains raw assembler nor
does it allow macro expansion inside assembler. The IDCT source from GCC
use macro expansion a lot, so a real port was undoable. I did a object
file dump conversion therefore. I know - it's ugly. If someone has a
better idea how to do it let me know.
> I'm also confused by your bit-twiddling average:
>
> average = (a & b) + (((a ^ b) & 0xfe) >> 1);
>
> What on earth is the purpose of the AND if you're just going to shift
> off the lower bit anyway?
>
Simply because the code above is for a single byte. It extends well to a
full machine-word as well, and in this case you need the AND to prevent
the LSB of byte1 to shift into MSB of byte0. This makes only sense if
you process more than one byte at a time of cause.
The 32 bit version of the above function should make it clear (hint
hint: this would be a easy to do improvement for oc_frag_recon_inter2_c)
ogg_uint32_t pavgub4 (ogg_uint32_t a, ogg_uint32_t b)
{
return (a & b) + (((a ^ b) & 0xfefefefe) >> 1);
}
I'll do a measurement how it performs against the "add and
shift"-version from the gcc sources.. Just to be sure that it is faster.
-----
Here are some benchmark results btw:
1.3Ghz Athlon. Profiling with AMD Code Analyst. I used dumpvid to decode
a large, high resolution, high quality ogg and sent the output to dev/null
Samples in libtheora.dll:
With MMX: 74660
Without MMX: 126909
Overall performance gain: ~ 1.7
Top ten cycle-eaters for the MMX build:
oc_frag_recon_inter2_mmx 7435
oc_dec_frags_recon_mcu_plane 6856
loop_filter_h4 5177
oc_state_frag_recon_mmx 4485
oc_dec_ac_coeff_unpack 4189
oggpackB_look 3136
oc_huff_token_decode 2855
oc_dec_coded_flags_unpack 2799
oc_frag_pred_dc 2735
oc_state_frag_copy_mmx 2606
Top ten cycle-eaters for the Non-MMX build:
oc_frag_recon_inter2_c 26971
idct8 11982
loop_filter_h 7973
loop_filter_v 7745
oc_frag_recon_inter_c 6821
oc_dec_frags_recon_mcu_plane 6485
oc_state_frag_recon_c 5886
oc_dec_ac_coeff_unpack 4154
oggpackB_look 3268
idct8_4 2971
My guess is that the P4 architecture will benefit even more from the MMX
port.
Nils
More information about the theora-dev
mailing list