[Theora-dev] MMX/SSE optimisations and GCC 3.4 builtins - launching the debate

Rodolphe Ortalo rodolphe.ortalo at free.fr
Tue Aug 24 14:28:03 PDT 2004

I've always wondered if the recent MMX&SSE-related builtins of GCC (and also 
the Intel Compiler if I understood correctly) were worth using.
I guess Wim's work was really an occasion to give them a try. I selected the 
sad8x8__mmxext() function (first one defined in dsp_mmxext.c) as a candidate 
for playing. Here are the results.

First of all, the compiled programs are apparently correct and I've run them 
successfully. (But I have not many test videos here.) Their performance seems 
to be on par with the original MMX-optimized thing.

Here are both variants:

1) Original (hand-written) assembly version (from Wim's patch):
static ogg_uint32_t sad8x8__mmxext (unsigned char *ptr1, ogg_uint32_t stride1,
		       	    unsigned char *ptr2, ogg_uint32_t stride2)
  ogg_uint32_t  DiffVal;

  __asm__ __volatile__ (
    "  .balign 16                   \n\t"
    "  pxor %%mm7, %%mm7            \n\t" 	/* mm7 contains the result */

    ".rept 7                        \n\t"
    "  movq (%1), %%mm0             \n\t"	/* take 8 bytes */
    "  movq (%2), %%mm1             \n\t"
    "  psadbw %%mm1, %%mm0          \n\t"
    "  add %3, %1                   \n\t"	/* Inc pointer into the new data */
    "  paddw %%mm0, %%mm7           \n\t"	/* accumulate difference... */
    "  add %4, %2                   \n\t"	/* Inc pointer into ref data */
    ".endr                          \n\t"

    "  movq (%1), %%mm0             \n\t"	/* take 8 bytes */
    "  movq (%2), %%mm1             \n\t"
    "  psadbw %%mm1, %%mm0          \n\t"
    "  paddw %%mm0, %%mm7           \n\t"	/* accumulate difference... */
    "  movd %%mm7, %0               \n\t"

     : "=r" (DiffVal),
       "+r" (ptr1), 
       "+r" (ptr2) 
     : "r" (stride1),
       "r" (stride2)
     : "memory"

  return DiffVal;

2) New version (using C and GCC builtins):
static ogg_uint32_t sad8x8__mmxext (unsigned char *ptr1, ogg_uint32_t stride1,
		       	    unsigned char *ptr2, ogg_uint32_t stride2)
#include <xmmintrin.h>
  ogg_uint32_t  i;
  __m64 acc; /* accumulator */

  acc = _mm_setr_pi16(0,0,0,0);
  for (i=8; i; i--) {
    __m64 tmp;
    tmp = _mm_sad_pu8(*((__m64*)ptr1),*((__m64*)ptr2)); /* aka psadbw */
    acc = _mm_add_pi16(tmp,acc); /* aka paddw */

    /* Step to next row of block. */
    ptr1 += stride1;
    ptr2 += stride2;
  return _mm_cvtsi64_si32(acc);

Obviously this much less intrusive with respect to the original C code (found 
in dsp.c).

3) Assembly generated by GCC 3.4 (with -O6 -mmmx -msse) obtained with the -S 
        .file   "dsp_mmxext.c"
        .p2align 4,,15
        .type   sad8x8__mmxext, @function
        pushl   %ebp
        movl    %esp, %ebp
        pushl   %edi
        pushl   %esi
        subl    $8, %esp
        movl    8(%ebp), %edx
        movl    12(%ebp), %edi
        movl    16(%ebp), %eax
        movl    20(%ebp), %esi
        pxor    %mm1, %mm1
        movl    $8, %ecx
        .p2align 4,,15
        movq    (%edx), %mm0
        psadbw  (%eax), %mm0
        paddw   %mm1, %mm0
        addl    %edi, %edx
        addl    %esi, %eax
        decl    %ecx
        movq    %mm0, %mm1
        jne     .L19
        movq    %mm0, -16(%ebp)
        movl    -16(%ebp), %eax
        addl    $8, %esp
        popl    %esi
        popl    %edi
        popl    %ebp

The resulting assembly seems pretty correct for me and, it seems to me the 
"near C" code of 2) is much more maintainable than the inline assembly of 1).
Also, I wonder if loop unrolling and register scheduling could not be done 
more intelligently by the compiler in some cases (but that's really a 
question for me).
But, of course, I don't master all the bells and whistles of inline assembly 
(infact, I've written very little assembly in the last decade...).



More information about the Theora-dev mailing list