[opus] Bug fix in celt_lpc.c and some xcorr_kernel optimizations
John Ridges
jridges at masque.com
Thu Jun 6 17:07:35 PDT 2013
Hi JM,
At line 221 in celt_lpc.c (the celt_iir function) I think you really
want the RESTORE_STACK statement to be before the #endif instead of
after it. Also, I couldn't help notice that your SSE code for
xcorr_kernel reads more than "len" elements of "_x". I don't know if
that's really a problem when running the codec, but a tool like valgrind
will have a fit if it's accessing uninitialized memory. Here's a version
I wrote a few days ago you're welcome to use that doesn't suffer from
that problem:
static inline void xcorr_kernel(const opus_val16 *x, const opus_val16
*y, opus_val32 sum[4], int len)
{
int j;
__m128 xsum1 = _mm_loadu_ps(sum);
__m128 xsum2 = _mm_setzero_ps();
for (j = 0; j < len-3; j += 4) {
const __m128 x0 = _mm_loadu_ps(x+j);
const __m128 y0 = _mm_loadu_ps(y+j);
const __m128 y3 = _mm_loadu_ps(y+j+3);
xsum1 =
_mm_add_ps(xsum1,_mm_mul_ps(_mm_shuffle_ps(x0,x0,0x00),y0));
xsum2 =
_mm_add_ps(xsum2,_mm_mul_ps(_mm_shuffle_ps(x0,x0,0x55),_mm_shuffle_ps(y0,y3,0x49)));
xsum1 =
_mm_add_ps(xsum1,_mm_mul_ps(_mm_shuffle_ps(x0,x0,0xaa),_mm_shuffle_ps(y0,y3,0x9e)));
xsum2 =
_mm_add_ps(xsum2,_mm_mul_ps(_mm_shuffle_ps(x0,x0,0xff),y3));
}
if (j < len) {
xsum1 =
_mm_add_ps(xsum1,_mm_mul_ps(_mm_load1_ps(x+j),_mm_loadu_ps(y+j)));
if (++j < len) {
xsum2 =
_mm_add_ps(xsum2,_mm_mul_ps(_mm_load1_ps(x+j),_mm_loadu_ps(y+j)));
if (++j < len) {
xsum1 =
_mm_add_ps(xsum1,_mm_mul_ps(_mm_load1_ps(x+j),_mm_loadu_ps(y+j)));
}
}
}
_mm_storeu_ps(sum,_mm_add_ps(xsum1,xsum2));
}
Also, here's a version of xcorr_kernel for fixed-point ARM NEON (sorry I
don't have a floating-point version, but I only use fixed-point opus in
ARM):
#include <arm_neon.h>
static inline void xcorr_kernel(const opus_val16 *x, const opus_val16
*y, opus_val32 sum[4], int len)
{
int j;
int32x4_t xsum1 = vld1q_s32(sum);
int32x4_t xsum2 = vdupq_n_s32(0);
for (j = 0; j < len-1; j += 2) {
xsum1 = vmlal_s16(xsum1,vdup_n_s16(*x++),vld1_s16(y++));
xsum2 = vmlal_s16(xsum2,vdup_n_s16(*x++),vld1_s16(y++));
}
if (j < len) {
xsum1 = vmlal_s16(xsum1,vdup_n_s16(*x),vld1_s16(y));
}
vst1q_s32(sum,vaddq_s32(xsum1,xsum2));
}
Cheers,
John Ridges
More information about the opus
mailing list