[Theora-dev] MMX/SSE optimisations and GCC 3.4 builtins - launching the debate

Rodolphe Ortalo rodolphe.ortalo at free.fr
Tue Aug 24 14:28:03 PDT 2004


I've always wondered if the recent MMX&SSE-related builtins of GCC (and also 
the Intel Compiler if I understood correctly) were worth using.
I guess Wim's work was really an occasion to give them a try. I selected the 
sad8x8__mmxext() function (first one defined in dsp_mmxext.c) as a candidate 
for playing. Here are the results.

First of all, the compiled programs are apparently correct and I've run them 
successfully. (But I have not many test videos here.) Their performance seems 
to be on par with the original MMX-optimized thing.

Here are both variants:

1) Original (hand-written) assembly version (from Wim's patch):
static ogg_uint32_t sad8x8__mmxext (unsigned char *ptr1, ogg_uint32_t stride1,
		       	    unsigned char *ptr2, ogg_uint32_t stride2)
{
  ogg_uint32_t  DiffVal;

  __asm__ __volatile__ (
    "  .balign 16                   \n\t"
    "  pxor %%mm7, %%mm7            \n\t" 	/* mm7 contains the result */

    ".rept 7                        \n\t"
    "  movq (%1), %%mm0             \n\t"	/* take 8 bytes */
    "  movq (%2), %%mm1             \n\t"
    "  psadbw %%mm1, %%mm0          \n\t"
    "  add %3, %1                   \n\t"	/* Inc pointer into the new data */
    "  paddw %%mm0, %%mm7           \n\t"	/* accumulate difference... */
    "  add %4, %2                   \n\t"	/* Inc pointer into ref data */
    ".endr                          \n\t"

    "  movq (%1), %%mm0             \n\t"	/* take 8 bytes */
    "  movq (%2), %%mm1             \n\t"
    "  psadbw %%mm1, %%mm0          \n\t"
    "  paddw %%mm0, %%mm7           \n\t"	/* accumulate difference... */
    "  movd %%mm7, %0               \n\t"

     : "=r" (DiffVal),
       "+r" (ptr1), 
       "+r" (ptr2) 
     : "r" (stride1),
       "r" (stride2)
     : "memory"
  );

  return DiffVal;
}


2) New version (using C and GCC builtins):
static ogg_uint32_t sad8x8__mmxext (unsigned char *ptr1, ogg_uint32_t stride1,
		       	    unsigned char *ptr2, ogg_uint32_t stride2)
{
#include <xmmintrin.h>
  ogg_uint32_t  i;
  __m64 acc; /* accumulator */

  acc = _mm_setr_pi16(0,0,0,0);
  for (i=8; i; i--) {
    __m64 tmp;
    tmp = _mm_sad_pu8(*((__m64*)ptr1),*((__m64*)ptr2)); /* aka psadbw */
    acc = _mm_add_pi16(tmp,acc); /* aka paddw */

    /* Step to next row of block. */
    ptr1 += stride1;
    ptr2 += stride2;
  }
  return _mm_cvtsi64_si32(acc);

Obviously this much less intrusive with respect to the original C code (found 
in dsp.c).

3) Assembly generated by GCC 3.4 (with -O6 -mmmx -msse) obtained with the -S 
flag:
        .file   "dsp_mmxext.c"
        .text
        .p2align 4,,15
        .type   sad8x8__mmxext, @function
sad8x8__mmxext:
        pushl   %ebp
        movl    %esp, %ebp
        pushl   %edi
        pushl   %esi
        subl    $8, %esp
        movl    8(%ebp), %edx
        movl    12(%ebp), %edi
        movl    16(%ebp), %eax
        movl    20(%ebp), %esi
        pxor    %mm1, %mm1
        movl    $8, %ecx
        .p2align 4,,15
.L19:
        movq    (%edx), %mm0
        psadbw  (%eax), %mm0
        paddw   %mm1, %mm0
        addl    %edi, %edx
        addl    %esi, %eax
        decl    %ecx
        movq    %mm0, %mm1
        jne     .L19
        movq    %mm0, -16(%ebp)
        movl    -16(%ebp), %eax
        addl    $8, %esp
        popl    %esi
        popl    %edi
        popl    %ebp
        ret
[...]

The resulting assembly seems pretty correct for me and, it seems to me the 
"near C" code of 2) is much more maintainable than the inline assembly of 1).
Also, I wonder if loop unrolling and register scheduling could not be done 
more intelligently by the compiler in some cases (but that's really a 
question for me).
But, of course, I don't master all the bells and whistles of inline assembly 
(infact, I've written very little assembly in the last decade...).

Opinions?

Rodolphe


More information about the Theora-dev mailing list