[theora-dev] Patch: fragment reconstruction MMX for GCC
Timothy B. Terriberry
tterribe at email.unc.edu
Sun Dec 30 18:45:12 PST 2007
Nils Pipenbrinck wrote:
> All routines perform much better now. Inter2 alone got a speedup of
> factor 5 on Pentium-M. Athlon CPU's execute roughly 3 times faster.
> Hadn't had the chance to benchmark core2 though. It would be nice to
> hear if the code compiles on 64bit intel.
Awesome. I've committed your code, with some modifications in r14336. It
tests identical to the old code on both x86-64 and x86-32.
There were two primary problems with the code as it stood. The first was
specific to x86-64: you have to cast the strides to long's so that they
are placed in 64-bit registers instead of 32-bit registers, or you can't
use them in indexing instructions with 64-bit pointers.
The second was specific to x86-32: when -fPIC is used and
-fomit-frame-pointer is not, x86-32 gets just _five_ general purpose
registers (%eax, %ecx, %edx, %esi, and %edi). All of your routines used
six. This is the cause of the oft-reported problem that the encoder asm
will not compile in debug mode.
Fortunately, it's relatively easy to eliminate a register from each
routine. oc_frag_recon_intra_mmx can get away with one fewer offset, and
letting gcc handle the looping in oc_frag_recon_inter_mmx allows it
to unroll the loop when -funroll-loops is enabled, eliminating the need
for a counter register. Without -funroll-loops, it will handle the
register spill itself. On x86-64, there's obviously an extra register
available, so it's not a problem.
I also eliminated the "safe" version of oc_frag_recon_inter2_mmx that
handled the case when the strides differ, because it never occurs, and I
don't foresee a situation when we'd want it to.
Also note that with -fomit-frame-pointer, gcc requires an extra register
if you use any "m" arguments, because it can't track how %esp changes
inside your asm block, so it can't generate a reference that is
guaranteed to work if you start mucking around with the stack. That's
what lead to the errors Ralph reported. Eliminating that version solved
the problem, but getting down to 5 registers would've solved it also.
More information about the theora-dev