[opus] [RFC PATCHv1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

Mon Dec 1 09:13:22 PST 2014

Hello Timothy,

Appreciate the thorough review. Have a few questions before I re-spin
the patch in-line.

On 28 November 2014 at 15:52, Timothy B. Terriberry <tterribe at xiph.org> wrote:
> Review comments inline.
>
>> +if OPUS_ARM_NEON_INTR
>> +noinst_LTLIBRARIES = libarmneon.la
>> +libarmneon_la_SOURCES = $(CELT_SOURCES_ARM_NEON_INTR)
>> +libarmneon_la_CPPFLAGS = $(OPUS_ARM_NEON_INTR_CPPFLAGS) -I$(top_srcdir)/include
>> +endif
>
> I don't think these should be in a separate library. It brings with it
> lots of complications (to name one: wouldn't the .pc files need to be
> updated?). Please use the same mechanism that the SSE intrinsics use to
> add CFLAGS to the compilation of specific object files, e.g.,

Sorry, I don't know what .pc file is. Also no strong opinions on my
side.. but I was merely following
guidance from
http://www.gnu.org/software/automake/manual/html_node/Per_002dObject-Flags.html
which specifically advocates against per object flags.

I can follow your feedback if you still think per-object flag is the
right thing to do for libopus.
Please let me know either way.

>
> $(SSE_OBJ): CFLAGS += -msse4.1
>
>> +void (*const CELT_PITCH_XCORR_IMPL[OPUS_ARCHMASK+1])(const opus_val16 *,
>> +    const opus_val16 *, opus_val32 *, int , int, int) = {
>> +  celt_pitch_xcorr_c,                /* ARMv4 */
>> +  celt_pitch_xcorr_c,                        /* EDSP */
>> +  celt_pitch_xcorr_c,                        /* Media */
>> +#if defined(OPUS_ARM_NEON_INTR)
>> +  celt_pitch_xcorr_float_neon                 /* Neon */
>
> Please do not use tabs in source code (this applies here and everywhere
> below). Even with the tabs expanded in context, the comments here do not
> line up properly.
Will do. Thanks.

>
>> +static void xcorr_kernel_neon_float(float *x, float *y, float sum[4], int len) {
>
> x and y should be const.
>
>> +     float32x4_t YY[5];
>> +     float32x4_t XX[4];
>> +     float32x2_t XX_2;
>> +     float32x4_t SUMM[4];
>> +     float *xi = x;
>> +     float *yi = y;
>> +     int cd = len/4;
>> +     int cr = len%4;
>
> len is signed, so / and % are NOT equivalent to the corresponding >> and
> & (they are much slower).
>
>> +     int j;
>> +
>> +     celt_assert(len>=3);
>> +
>> +     /* Initialize sums to 0 */
>> +     SUMM[0] = vdupq_n_f32(0);
>> +     SUMM[1] = vdupq_n_f32(0);
>> +     SUMM[2] = vdupq_n_f32(0);
>> +     SUMM[3] = vdupq_n_f32(0);
>> +
>> +     YY[0] = vld1q_f32(yi);
>> +
>> +     /* Each loop consumes 8 floats in y vector
>> +      * and 4 floats in x vector
>> +      */
>> +     for (j = 0; j < cd; j++) {
>> +             yi += 4;
>> +             YY[4] = vld1q_f32(yi);
>
> If len == 4, then in the first iteration you will have loaded 8 y
> values, but only 7 are guaranteed to be available (e.g., the C code only
> references y[0] up to y[len-1+3]). You need to end this loop early and
> fall back to another approach. See comments in celt_pitch_xcorr_arm.s
> for details and an example (there are other useful comments there that
> could shave another cycle or two from this inner loop).
Analyzed the implementation in celt_pitch_xcorr_arm.s. I will re-do my
implementation to follow same algorithm. It seems more elegant.
This comment applies to rest of feedback on celt_neon_intr.c
>
>> +             YY[1] = vextq_f32(YY[0], YY[4], 1);
>> +             YY[2] = vextq_f32(YY[0], YY[4], 2);
>> +             YY[3] = vextq_f32(YY[0], YY[4], 3);
>> +
>> +             XX[0] = vld1q_dup_f32(xi++);
>> +             XX[1] = vld1q_dup_f32(xi++);
>> +             XX[2] = vld1q_dup_f32(xi++);
>> +             XX[3] = vld1q_dup_f32(xi++);
>
> Don't do this. Do a single load and use vmlaq_lane_f32() to multiply by
> each value. That should cut at least 5 cycles out of this loop.
>
>> +
>> +             SUMM[0] = vmlaq_f32(SUMM[0], XX[0], YY[0]);
>> +             SUMM[1] = vmlaq_f32(SUMM[1], XX[1], YY[1]);
>> +             SUMM[2] = vmlaq_f32(SUMM[2], XX[2], YY[2]);
>> +             SUMM[3] = vmlaq_f32(SUMM[3], XX[3], YY[3]);
>> +             YY[0] = YY[4];
>> +     }
>> +
>> +     /* Handle remaining values max iterations = 3 */
>> +     for (j = 0; j < cr; j++) {
>> +             YY[0] = vld1q_f32(yi++);
>
> This load is always redundant in the first iteration, which is a bit
> unfortunate.
>
>> +             XX_2 = vld1_lane_f32(xi++, XX_2, 0);
>
> Don't load a single lane when you don't need the value(s) in the other
> lane(s). Use vld1_dup_f32() instead. It's faster and breaks dependencies.
Will keep this in mind. Thanks.
>
>> +             SUMM[0] = vmlaq_lane_f32(SUMM[0], YY[0], XX_2, 0);
>> +     }
>> +
>> +     SUMM[0] = vaddq_f32(SUMM[0], SUMM[1]);
>> +     SUMM[2] = vaddq_f32(SUMM[2], SUMM[3]);
>> +     SUMM[0] = vaddq_f32(SUMM[0], SUMM[2]);
>> +
>> +     vst1q_f32(sum, SUMM[0]);
>> +}
>> +
>> +void celt_pitch_xcorr_float_neon(const opus_val16 *_x, const opus_val16 *_y,
>> +                     opus_val32 *xcorr, int len, int max_pitch, int arch) {
>
> arch is unused. There's no reason to pass it here. If we're here, we
> know what the arch is.
>
>> +     int i, j;
>> +
>> +     celt_assert(max_pitch > 0);
>> +     celt_assert((((unsigned char *)_x-(unsigned char *)NULL)&3)==0);
>> +
>> +     for (i = 0; i < (max_pitch-3); i += 4) {
>> +             xcorr_kernel_neon_float((float *)_x, (float *)_y+i,
>> +                                     (float *)xcorr+i, len);
>> +     }
>> +
>> +     /* In case max_pitch isn't multiple of 4, do unrolled version */
>> +     for (; i < max_pitch; i++) {
>> +             float sum = 0;
>> +             float *yi = _y+i;
>> +             for (j = 0; j < len; j++)
>> +                     sum += _x[i]*yi[i];
>> +             xcorr[i] = sum;
>> +     }
>
> This loop can still be largely vectorized. Any reason not to do so?
Should have put TBD in the comments..I will try to get this done in the re-spin.
>
>> +}
>> diff --git a/celt/arm/pitch_arm.h b/celt/arm/pitch_arm.h
>> index a07f8ac..f5adc48 100644
>> --- a/celt/arm/pitch_arm.h
>> +++ b/celt/arm/pitch_arm.h
>> @@ -52,6 +52,19 @@ opus_val32 celt_pitch_xcorr_edsp(const opus_val16 *_x, const opus_val16 *_y,
>>     ((void)(arch),PRESUME_NEON(celt_pitch_xcorr)(_x, _y, xcorr, len, max_pitch))
>>   #  endif
>>
>> -# endif
>> +#else /* Start !FIXED_POINT */
>> +/* Float case */
>> +#if defined(OPUS_ARM_NEON_INTR)
>> +void celt_pitch_xcorr_float_neon(const opus_val16 *_x, const opus_val16 *_y,
>> +                             opus_val32 *xcorr, int len, int max_pitch, int arch);
>> +#endif
>> +
>> +#if !defined(OPUS_HAVE_RTCD) && defined(OPUS_PRESUME_ARM_NEON_INTR)
>> +#define OVERRIDE_PITCH_XCORR (1)
>> +#define celt_pitch_xcorr(_x, _y, xcorr, len, max_pitch, arch) \
>> +     ((void)(arch),celt_pitch_xcorr_float_neon(_x, _y, xcorr, \
>> +                                             len, max_pitch, arch))
>
> Again, don't pass arch.
I really did not want to pass "arch", but couldn't figure out a way to
avoid passing arch as the
signature of
celt_pitch_xcorr_c(const opus_val16 *_x, const opus_val16 *_y,
       opus_val32 *xcorr, int len, int max_pitch, int arch)
has arch as the parameter.

Also, xcorr_kernel_c() which is called by celt_pitch_xcorr_c() could
also be optimized in isolation
for some other CPU architecture.

I really don't even understand how the FIXED point code is even
working.. doesn't it face this
same problem? Don't you get compile error that function signatures are
not matching?

I tried as much to be least disruptive here, but my other alternative
thought is that I'm really not
sure why arch is being passed for each call to celt_pitch_xcorr() and
we do a table lookup for
the appropriate function every time.

Instead, can't we just see what arch we are running on during (may be)
opus_custom_decoder_init() or some appropriate function that must be called..
which we do any ways, and then assign the function pointers to any
optimized or c functions at that time?

>
>>   (*const CELT_PITCH_XCORR_IMPL[OPUS_ARCHMASK+1])(const opus_val16 *,
>> -      const opus_val16 *, opus_val32 *, int, int);
>> +      const opus_val16 *, opus_val32 *, int, int
>> +#if !defined(FIXED_POINT)
>> +     ,int
>> +#endif
>> +);
>
> Which gets rid of this ugliness.
>
>> +#if defined(FIXED_POINT)
>> +#  define celt_pitch_xcorr(_x, _y, xcorr, len, max_pitch, arch) \
>> +  ((*CELT_PITCH_XCORR_IMPL[(arch)&OPUS_ARCHMASK])(_x, _y, \
>> +        xcorr, len, max_pitch)
>> +#else
>>   #  define celt_pitch_xcorr(_x, _y, xcorr, len, max_pitch, arch) \
>>     ((*CELT_PITCH_XCORR_IMPL[(arch)&OPUS_ARCHMASK])(_x, _y, \
>> -        xcorr, len, max_pitch))
>> +        xcorr, len, max_pitch, arch))
>> +#endif
>> +
>
> And this.
>
>> diff --git a/configure.ac b/configure.ac
>> index 9b2f51f..09657b6 100644
>> --- a/configure.ac
>> +++ b/configure.ac
>> @@ -198,12 +198,11 @@ cpu_arm=no
>>
>>   AS_IF([test x"${enable_asm}" = x"yes"],[
>>       inline_optimization="No ASM for your platform, please send patches"
>> +    OPUS_ARM_NEON_INTR_CPPFLAGS=
>>       case $host_cpu in
>>         arm*)
>> -        dnl Currently we only have asm for fixed-point
>> -        AS_IF([test "$enable_float" != "yes"],[
>>               cpu_arm=yes
>> -            AC_DEFINE([OPUS_ARM_ASM], [],  [Make use of ARM asm optimization])
>> +            AC_DEFINE([OPUS_ARM_ASM], [],  [Make use of ARM asm/intrinsic optimization])
>
> Not sure I'm in love with conflating intrinsics with inline assembly.