[opus] [RFC PATCH v2] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

Tue Dec 9 12:18:15 PST 2014

On 8 December 2014 at 21:12, Timothy B. Terriberry <tterribe at xiph.org> wrote:
>
> Viswanath Puttagunta wrote:
> > +   SUMM = vdupq_n_f32(0);
>
> It kills me that there's no intrinsic for VMOV.F32 d0, #0 (or at least I
> couldn't find one), so this takes two instructions instead of one.
Vish>> Yes, I rechecked the Neon Programmer's guide from ARM.. don't
see a better option here.
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.den0018a/index.html
I will investigate more and discuss with Linaro toolchain engineers
off topic later.

>
> > +   /* Consume 4 elements in x vector and 8 elements in y
> > +    * vector. However, the 8'th element in y never really gets
> > +    * touched in this loop. So, if len == 4, then we only
> > +    * must access y[0] to y[6]. y[7] must not be accessed
> > +    * hence make sure len>4 and not len>=4
> > +    */
> > +   while (len > 4) {
>
> This can be an if() instead of a while (though it looks like gcc is
> smart enough to figure that out).
>
> > +   /* Just unroll the rest of the loop */
>
> If you're not going to special case the last 2+1+1 samples, is there a
> measurable performance difference compared to simply looping?

Vish>> I got the idea of unrolling this from your earlier comment
"This load is always redundant in the first iteration, which is a bit
unfortunate."
Now, that I unrolled it and tested it to make sure it works, is there
a specific reason you think this is will be any slower than a simple
loop and want to go back? I know it saves a load as compared to a
single loop. It would be hard to prove that it is measurably better
than the simple loop. But I would sincerely prefer it this way.. It is
fairly straight forward code in my opinion.

>
> > +   yi++;
> > +   switch(len) {
> > +   case 4:
> > +      XX_2 = vld1_dup_f32(xi++);
> > +      SUMM = vmlaq_lane_f32(SUMM, YY[0], XX_2, 0);
> > +      YY[0] = vld1q_f32(yi++);
> > +   case 3:
> > +      XX_2 = vld1_dup_f32(xi++);
> > +      SUMM = vmlaq_lane_f32(SUMM, YY[0], XX_2, 0);
> > +      YY[0] = vld1q_f32(yi++);
> > +   case 2:
> > +      XX_2 = vld1_dup_f32(xi++);
> > +      SUMM = vmlaq_lane_f32(SUMM, YY[0], XX_2, 0);
> > +      YY[0] = vld1q_f32(yi++);
> > +   case 1:
> > +      XX_2 = vld1_dup_f32(xi++);
> > +      SUMM = vmlaq_lane_f32(SUMM, YY[0], XX_2, 0);
> > +   }
> > +
> > +   vst1q_f32(sum, SUMM);
> > +}
> > +
> > +/*
> > + * Function: xcorr3to1_kernel_neon_float
> > + * ---------------------------------
> > + * Computes single correlation values and stores in *sum
> > + */
> > +void xcorr3to1_kernel_neon_float(const float *x, const float *y,
> > +               float *sum, int len) {
>
> I had to think quite a bit about what "3to1" meant (since it is
> describing the context of the caller, not what the actual function
> does). I'd follow the naming convention in the existing
> celt_pitch_xcorr_arm.s, and use "process1", personally.
Vish>> My bad, I initially implemented the loop inside this function, but
later on changed the functionality. I will follow your feedback.
>
> > +   int i;
> > +   float32x4_t XX[4];
> > +   float32x4_t YY[4];
> > +   float32x4_t SUMM;
> > +   float32x2_t ZERO;
> > +   float32x2x2_t tv;
> > +   float sumi;
> > +   float *xi = x;
> > +   float *yi = y;
> > +
> > +   ZERO = vdup_n_f32(0);
> > +   SUMM = vdupq_n_f32(0);
> > +
> > +   /* Work on 16 values per cycle */
>
> s/cycle/iteration/ (here and below). In performance-critical code when
> you say cycle I think machine cycle, and NEON definitely can't process
> 16 floats in one of those.
Vish >> Got it. Thanks.
>
> > +   while (len >= 16) {
>
> > +   /* Accumulate results into single float */
> > +   tv.val[0] = vadd_f32(vget_low_f32(SUMM), vget_high_f32(SUMM));
> > +   tv = vtrn_f32(tv.val[0], ZERO);
> > +   tv.val[0] = vadd_f32(tv.val[0], tv.val[1]);
> > +
> > +   vst1_lane_f32(&sumi, tv.val[0], 0);
>
> Accessing tv.val[0] and tv.val[1] directly seems to send these values
> through the stack, e.g.,
>
>    f4:   f3ba7085        vtrn.32 d7, d5
>    f8:   ed0b7b0f        vstr    d7, [fp, #-60]
>    fc:   ed0b5b0d        vstr    d5, [fp, #-52]
> ...
>   114:   ed1b6b09        vldr    d6, [fp, #-36]
>   118:   ed1b7b0b        vldr    d7, [fp, #-44]
>   11c:   f2077d06        vadd.f32        d7, d7, d6
>   120:   f483780f        vst1.32 {d7[0]}, [r3]
>
> Can't you just use
>    float32x2_t tv;
>    tv = vadd_f32(vget_low_f32(SUMM), vget_high_f32(SUMM));
>    tv = vpadd_f32(tv, tv);
>
> (you can get rid of ZERO, then, too)
Vish>> Thanks for the tip. I missed this vpadd_f32 instruction. I will use this.
    I always thought vtrn was a contortionist way to do what I wanted.
    Although using tv.val[x] shouldn't have to use the stack... the
compiler should be smarter than that.
    I will investigate about this later.. but will use vpadd_f32 for now.

>
> > +   for (i = 0; i < len; i++)
> > +      sumi += xi[i] * yi[i];
> > +   *sum = sumi;
>
> This bounces things through the stack into vfp registers, which is a
> huge stall. I'd continue to use NEON here with
> vld1_dup_f32()/vmla_f32()/etc.
Vish>> OK, will do.
>
> > +}
>
> > +#if ((defined(OPUS_ARM_ASM) && defined(FIXED_POINT)) \
> > +  || defined(OPUS_ARM_NEON_INTR))
> >   #include "arm/arm_celt_map.c"
> >   #endif
> >
> > +
>
> Unrelated whitespace change.
Vish>> OK, will do.
>
> > -    inline_optimization="No ASM for your platform, please send patches"
> > +    inline_optimization="No in-line ASM for your platform, please send patches"
>
> "inline" is one word.
Vish>> OK, will do.
>
> > +      AC_MSG_CHECKING(if compiler supports arm neon intrinsics)
>
> Capitalize ARM and NEON, please.
Vish>> OK, will do.
>
> > +      save_CFLAGS="$CFLAGS"; CFLAGS="-mfpu=neon $CFLAGS"
> > +      AC_COMPILE_IFELSE(
>
> Can we use AC_LINK_IFELSE? We had a problem where sometimes if SSE/AVX
> was not available, but the headers existed, the #include would succeed
> but no functions would get defined. My arm_neon.h seems to be written
> better than that, but I'd like to guard against other implementations
> having similar problems (and keep things consistent with the SSE tests).
>
> > +         [AC_LANG_PROGRAM([[#include <arm_neon.h>]], [])],
>
> You also need to include a call to an actual NEON intrinsic here. If the
> function is not defined, even a call to it here will compile (with an
> implicit declaration warning), but linking will fail.
Vish>> OK, will do.
>
> > +         [
> > +            OPUS_ARM_NEON_INTR=1
> > +            AC_MSG_RESULT([yes])
> > +         ],[
> > +            OPUS_ARM_NEON_INR=0
>
> OPUS_ARM_NEON_INTR (you're missing a 'T')
Vish>> Good catch. Thanks.
>
> > +            AC_MSG_RESULT([no])
> > +         ])
> > +      CFLAGS="$save_CFLAGS"
> > +      #Now we know if compiler supports ARM neon intrinsics or not
> > +
> > +      #Currently we only have intrinsic optimization for floating point
> > +      AS_IF([test x"$enable_float" = x"yes"],
> > +      [
> > +         AS_IF([test x"$OPUS_ARM_NEON_INTR" = x"1"],
> > +         [
> > +            OPUS_ARM_NEON_INTR_CPPFLAGS="-mfpu=neon -O3"
>
> I don't think you should change the optimization level here.
Vish >> OK, will add flags in Makefile.am
> _______________________________________________
> opus mailing list
> opus at xiph.org
> http://lists.xiph.org/mailman/listinfo/opus