[opus] Bug fix in celt_lpc.c and some xcorr_kernel optimizations

John Ridges jridges at masque.com
Thu Jun 6 17:07:35 PDT 2013


Hi JM,

At line 221 in celt_lpc.c (the celt_iir function) I think you really 
want the RESTORE_STACK statement to be before the #endif instead of 
after it. Also, I couldn't help notice that your SSE code for 
xcorr_kernel reads more than "len" elements of "_x". I don't know if 
that's really a problem when running the codec, but a tool like valgrind 
will have a fit if it's accessing uninitialized memory. Here's a version 
I wrote a few days ago you're welcome to use that doesn't suffer from 
that problem:

static inline void xcorr_kernel(const opus_val16 *x, const opus_val16 
*y, opus_val32 sum[4], int len)
{
     int j;
     __m128 xsum1 = _mm_loadu_ps(sum);
     __m128 xsum2 = _mm_setzero_ps();

     for (j = 0; j < len-3; j += 4) {
         const __m128 x0 = _mm_loadu_ps(x+j);
         const __m128 y0 = _mm_loadu_ps(y+j);
         const __m128 y3 = _mm_loadu_ps(y+j+3);

         xsum1 = 
_mm_add_ps(xsum1,_mm_mul_ps(_mm_shuffle_ps(x0,x0,0x00),y0));
         xsum2 = 
_mm_add_ps(xsum2,_mm_mul_ps(_mm_shuffle_ps(x0,x0,0x55),_mm_shuffle_ps(y0,y3,0x49)));
         xsum1 = 
_mm_add_ps(xsum1,_mm_mul_ps(_mm_shuffle_ps(x0,x0,0xaa),_mm_shuffle_ps(y0,y3,0x9e)));
         xsum2 = 
_mm_add_ps(xsum2,_mm_mul_ps(_mm_shuffle_ps(x0,x0,0xff),y3));
     }
     if (j < len) {
         xsum1 = 
_mm_add_ps(xsum1,_mm_mul_ps(_mm_load1_ps(x+j),_mm_loadu_ps(y+j)));
         if (++j < len) {
             xsum2 = 
_mm_add_ps(xsum2,_mm_mul_ps(_mm_load1_ps(x+j),_mm_loadu_ps(y+j)));
             if (++j < len) {
                 xsum1 = 
_mm_add_ps(xsum1,_mm_mul_ps(_mm_load1_ps(x+j),_mm_loadu_ps(y+j)));
             }
         }
     }
     _mm_storeu_ps(sum,_mm_add_ps(xsum1,xsum2));
}

Also, here's a version of xcorr_kernel for fixed-point ARM NEON (sorry I 
don't have a floating-point version, but I only use fixed-point opus in 
ARM):

#include <arm_neon.h>

static inline void xcorr_kernel(const opus_val16 *x, const opus_val16 
*y, opus_val32 sum[4], int len)
{
     int j;
     int32x4_t xsum1 = vld1q_s32(sum);
     int32x4_t xsum2 = vdupq_n_s32(0);

     for (j = 0; j < len-1; j += 2) {
         xsum1 = vmlal_s16(xsum1,vdup_n_s16(*x++),vld1_s16(y++));
         xsum2 = vmlal_s16(xsum2,vdup_n_s16(*x++),vld1_s16(y++));
     }
     if (j < len) {
         xsum1 = vmlal_s16(xsum1,vdup_n_s16(*x),vld1_s16(y));
     }
     vst1q_s32(sum,vaddq_s32(xsum1,xsum2));
}


Cheers,
John Ridges




More information about the opus mailing list