[Theora-dev] MMX/SSE optimisations and GCC 3.4 builtins - launching
the debate
Rodolphe Ortalo
rodolphe.ortalo at free.fr
Tue Aug 24 14:28:03 PDT 2004
I've always wondered if the recent MMX&SSE-related builtins of GCC (and also
the Intel Compiler if I understood correctly) were worth using.
I guess Wim's work was really an occasion to give them a try. I selected the
sad8x8__mmxext() function (first one defined in dsp_mmxext.c) as a candidate
for playing. Here are the results.
First of all, the compiled programs are apparently correct and I've run them
successfully. (But I have not many test videos here.) Their performance seems
to be on par with the original MMX-optimized thing.
Here are both variants:
1) Original (hand-written) assembly version (from Wim's patch):
static ogg_uint32_t sad8x8__mmxext (unsigned char *ptr1, ogg_uint32_t stride1,
unsigned char *ptr2, ogg_uint32_t stride2)
{
ogg_uint32_t DiffVal;
__asm__ __volatile__ (
" .balign 16 \n\t"
" pxor %%mm7, %%mm7 \n\t" /* mm7 contains the result */
".rept 7 \n\t"
" movq (%1), %%mm0 \n\t" /* take 8 bytes */
" movq (%2), %%mm1 \n\t"
" psadbw %%mm1, %%mm0 \n\t"
" add %3, %1 \n\t" /* Inc pointer into the new data */
" paddw %%mm0, %%mm7 \n\t" /* accumulate difference... */
" add %4, %2 \n\t" /* Inc pointer into ref data */
".endr \n\t"
" movq (%1), %%mm0 \n\t" /* take 8 bytes */
" movq (%2), %%mm1 \n\t"
" psadbw %%mm1, %%mm0 \n\t"
" paddw %%mm0, %%mm7 \n\t" /* accumulate difference... */
" movd %%mm7, %0 \n\t"
: "=r" (DiffVal),
"+r" (ptr1),
"+r" (ptr2)
: "r" (stride1),
"r" (stride2)
: "memory"
);
return DiffVal;
}
2) New version (using C and GCC builtins):
static ogg_uint32_t sad8x8__mmxext (unsigned char *ptr1, ogg_uint32_t stride1,
unsigned char *ptr2, ogg_uint32_t stride2)
{
#include <xmmintrin.h>
ogg_uint32_t i;
__m64 acc; /* accumulator */
acc = _mm_setr_pi16(0,0,0,0);
for (i=8; i; i--) {
__m64 tmp;
tmp = _mm_sad_pu8(*((__m64*)ptr1),*((__m64*)ptr2)); /* aka psadbw */
acc = _mm_add_pi16(tmp,acc); /* aka paddw */
/* Step to next row of block. */
ptr1 += stride1;
ptr2 += stride2;
}
return _mm_cvtsi64_si32(acc);
Obviously this much less intrusive with respect to the original C code (found
in dsp.c).
3) Assembly generated by GCC 3.4 (with -O6 -mmmx -msse) obtained with the -S
flag:
.file "dsp_mmxext.c"
.text
.p2align 4,,15
.type sad8x8__mmxext, @function
sad8x8__mmxext:
pushl %ebp
movl %esp, %ebp
pushl %edi
pushl %esi
subl $8, %esp
movl 8(%ebp), %edx
movl 12(%ebp), %edi
movl 16(%ebp), %eax
movl 20(%ebp), %esi
pxor %mm1, %mm1
movl $8, %ecx
.p2align 4,,15
.L19:
movq (%edx), %mm0
psadbw (%eax), %mm0
paddw %mm1, %mm0
addl %edi, %edx
addl %esi, %eax
decl %ecx
movq %mm0, %mm1
jne .L19
movq %mm0, -16(%ebp)
movl -16(%ebp), %eax
addl $8, %esp
popl %esi
popl %edi
popl %ebp
ret
[...]
The resulting assembly seems pretty correct for me and, it seems to me the
"near C" code of 2) is much more maintainable than the inline assembly of 1).
Also, I wonder if loop unrolling and register scheduling could not be done
more intelligently by the compiler in some cases (but that's really a
question for me).
But, of course, I don't master all the bells and whistles of inline assembly
(infact, I've written very little assembly in the last decade...).
Opinions?
Rodolphe
More information about the Theora-dev
mailing list