[theora-dev] VC2005 MMX patch.
n.pipenbrinck at cubic.org
Wed Dec 26 09:55:06 PST 2007
> I notice your patch does not use the port to intrinsics you said you
> did, except for a few small bits in mmxstate.c (and thus, everything
> else will not support x86-64). Did you test the speed of the intrinsics
> version against your hand-rolled version? What were the results?
The intrinsic versions are roughly 1.5 times slower than the assembler
version. Visual C generates lots of unnecessary register moves and is
not very clever at scheduling loads early. This is not much of a problem
for large and high bit rate files since these are dominated by cache
misses, but it starts to make a difference for small sized videos.
The main problem is that in the fragment reconstructions (for example)
we often need constants of zero. Even if I preload a MMX register with
zero MSVC will create a temporary copy also the contents of this
register will never be destroyed. The extra instruction is something we
could live with, but the increased register pressure cause slow code to
I used the intrinsics in the mmxstate.c because the compiler did a good
job at interleaving/inlining the MMX parts into the ordinary integer
instruction flow. Here it is a real benefit.
Regarding x86-64: The MSVC helpfile tells me that the MMX intrinsics
with 64 bit operands won't be supported for the 64 bit compiler, so
additional work is required anyways (for whatever reason they decided
to do that).
> I also notice you made lots of minor changes, which will make it more
> difficult to keep the code in sync with the gcc version.
I only did logical changes in the files that are for visual studio. The
few changes in the other files where required to get a theora building
with the USE_ASM macro.
If this becomes a problem I'd be happy to merge it into the gcc tree.
> I'd like to
> keep things as consistent as possible. E.g., what's the rational for
> expanding out all of the macros for the IDCT, other than, "it was easier
> that way"? Does MSVC really not unroll loops with inline asm in them for
You hit the nail on the head:
MSVC does neither inline any function that contains raw assembler nor
does it allow macro expansion inside assembler. The IDCT source from GCC
use macro expansion a lot, so a real port was undoable. I did a object
file dump conversion therefore. I know - it's ugly. If someone has a
better idea how to do it let me know.
> I'm also confused by your bit-twiddling average:
> average = (a & b) + (((a ^ b) & 0xfe) >> 1);
> What on earth is the purpose of the AND if you're just going to shift
> off the lower bit anyway?
Simply because the code above is for a single byte. It extends well to a
full machine-word as well, and in this case you need the AND to prevent
the LSB of byte1 to shift into MSB of byte0. This makes only sense if
you process more than one byte at a time of cause.
The 32 bit version of the above function should make it clear (hint
hint: this would be a easy to do improvement for oc_frag_recon_inter2_c)
ogg_uint32_t pavgub4 (ogg_uint32_t a, ogg_uint32_t b)
return (a & b) + (((a ^ b) & 0xfefefefe) >> 1);
I'll do a measurement how it performs against the "add and
shift"-version from the gcc sources.. Just to be sure that it is faster.
Here are some benchmark results btw:
1.3Ghz Athlon. Profiling with AMD Code Analyst. I used dumpvid to decode
a large, high resolution, high quality ogg and sent the output to dev/null
Samples in libtheora.dll:
With MMX: 74660
Without MMX: 126909
Overall performance gain: ~ 1.7
Top ten cycle-eaters for the MMX build:
Top ten cycle-eaters for the Non-MMX build:
My guess is that the P4 architecture will benefit even more from the MMX
More information about the theora-dev