[opus] Bug fix in celt_lpc.c and some xcorr_kernel optimizations
Jean-Marc Valin
jmvalin at jmvalin.ca
Thu Jun 6 20:22:45 PDT 2013
Hi John,
Thanks for the two fixes. They're in git now. Your SSE version seems to
also be slightly faster than mine -- probably due the the partial sums.
As for the NEON code, it would be good to compare the performance with
the code Aurélien Zanelli posted at
http://darkosphere.fr/public/0002-Add-optimized-NEON-version-of-celt_fir-celt_iir-and-.patch
Cheers,
Jean-Marc
On 06/06/2013 08:07 PM, John Ridges wrote:
> Hi JM,
>
> At line 221 in celt_lpc.c (the celt_iir function) I think you really
> want the RESTORE_STACK statement to be before the #endif instead of
> after it. Also, I couldn't help notice that your SSE code for
> xcorr_kernel reads more than "len" elements of "_x". I don't know if
> that's really a problem when running the codec, but a tool like valgrind
> will have a fit if it's accessing uninitialized memory. Here's a version
> I wrote a few days ago you're welcome to use that doesn't suffer from
> that problem:
>
> static inline void xcorr_kernel(const opus_val16 *x, const opus_val16
> *y, opus_val32 sum[4], int len)
> {
> int j;
> __m128 xsum1 = _mm_loadu_ps(sum);
> __m128 xsum2 = _mm_setzero_ps();
>
> for (j = 0; j < len-3; j += 4) {
> const __m128 x0 = _mm_loadu_ps(x+j);
> const __m128 y0 = _mm_loadu_ps(y+j);
> const __m128 y3 = _mm_loadu_ps(y+j+3);
>
> xsum1 =
> _mm_add_ps(xsum1,_mm_mul_ps(_mm_shuffle_ps(x0,x0,0x00),y0));
> xsum2 =
> _mm_add_ps(xsum2,_mm_mul_ps(_mm_shuffle_ps(x0,x0,0x55),_mm_shuffle_ps(y0,y3,0x49)));
> xsum1 =
> _mm_add_ps(xsum1,_mm_mul_ps(_mm_shuffle_ps(x0,x0,0xaa),_mm_shuffle_ps(y0,y3,0x9e)));
> xsum2 =
> _mm_add_ps(xsum2,_mm_mul_ps(_mm_shuffle_ps(x0,x0,0xff),y3));
> }
> if (j < len) {
> xsum1 =
> _mm_add_ps(xsum1,_mm_mul_ps(_mm_load1_ps(x+j),_mm_loadu_ps(y+j)));
> if (++j < len) {
> xsum2 =
> _mm_add_ps(xsum2,_mm_mul_ps(_mm_load1_ps(x+j),_mm_loadu_ps(y+j)));
> if (++j < len) {
> xsum1 =
> _mm_add_ps(xsum1,_mm_mul_ps(_mm_load1_ps(x+j),_mm_loadu_ps(y+j)));
> }
> }
> }
> _mm_storeu_ps(sum,_mm_add_ps(xsum1,xsum2));
> }
>
> Also, here's a version of xcorr_kernel for fixed-point ARM NEON (sorry I
> don't have a floating-point version, but I only use fixed-point opus in
> ARM):
>
> #include <arm_neon.h>
>
> static inline void xcorr_kernel(const opus_val16 *x, const opus_val16
> *y, opus_val32 sum[4], int len)
> {
> int j;
> int32x4_t xsum1 = vld1q_s32(sum);
> int32x4_t xsum2 = vdupq_n_s32(0);
>
> for (j = 0; j < len-1; j += 2) {
> xsum1 = vmlal_s16(xsum1,vdup_n_s16(*x++),vld1_s16(y++));
> xsum2 = vmlal_s16(xsum2,vdup_n_s16(*x++),vld1_s16(y++));
> }
> if (j < len) {
> xsum1 = vmlal_s16(xsum1,vdup_n_s16(*x),vld1_s16(y));
> }
> vst1q_s32(sum,vaddq_s32(xsum1,xsum2));
> }
>
>
> Cheers,
> John Ridges
>
>
> _______________________________________________
> opus mailing list
> opus at xiph.org
> http://lists.xiph.org/mailman/listinfo/opus
>
More information about the opus
mailing list