[opus] [RFC PATCH v2] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

Mon Dec 8 19:12:50 PST 2014

Viswanath Puttagunta wrote:
> +   SUMM = vdupq_n_f32(0);

It kills me that there's no intrinsic for VMOV.F32 d0, #0 (or at least I 
couldn't find one), so this takes two instructions instead of one.

> +   /* Consume 4 elements in x vector and 8 elements in y
> +    * vector. However, the 8'th element in y never really gets
> +    * touched in this loop. So, if len == 4, then we only
> +    * must access y[0] to y[6]. y[7] must not be accessed
> +    * hence make sure len>4 and not len>=4
> +    */
> +   while (len > 4) {

This can be an if() instead of a while (though it looks like gcc is 
smart enough to figure that out).

> +   /* Just unroll the rest of the loop */

If you're not going to special case the last 2+1+1 samples, is there a 
measurable performance difference compared to simply looping?

> +   yi++;
> +   switch(len) {
> +   case 4:
> +      XX_2 = vld1_dup_f32(xi++);
> +      SUMM = vmlaq_lane_f32(SUMM, YY[0], XX_2, 0);
> +      YY[0] = vld1q_f32(yi++);
> +   case 3:
> +      XX_2 = vld1_dup_f32(xi++);
> +      SUMM = vmlaq_lane_f32(SUMM, YY[0], XX_2, 0);
> +      YY[0] = vld1q_f32(yi++);
> +   case 2:
> +      XX_2 = vld1_dup_f32(xi++);
> +      SUMM = vmlaq_lane_f32(SUMM, YY[0], XX_2, 0);
> +      YY[0] = vld1q_f32(yi++);
> +   case 1:
> +      XX_2 = vld1_dup_f32(xi++);
> +      SUMM = vmlaq_lane_f32(SUMM, YY[0], XX_2, 0);
> +   }
> +
> +   vst1q_f32(sum, SUMM);
> +}
> +
> +/*
> + * Function: xcorr3to1_kernel_neon_float
> + * ---------------------------------
> + * Computes single correlation values and stores in *sum
> + */
> +void xcorr3to1_kernel_neon_float(const float *x, const float *y,
> +               float *sum, int len) {

I had to think quite a bit about what "3to1" meant (since it is 
describing the context of the caller, not what the actual function 
does). I'd follow the naming convention in the existing 
celt_pitch_xcorr_arm.s, and use "process1", personally.

> +   int i;
> +   float32x4_t XX[4];
> +   float32x4_t YY[4];
> +   float32x4_t SUMM;
> +   float32x2_t ZERO;
> +   float32x2x2_t tv;
> +   float sumi;
> +   float *xi = x;
> +   float *yi = y;
> +
> +   ZERO = vdup_n_f32(0);
> +   SUMM = vdupq_n_f32(0);
> +
> +   /* Work on 16 values per cycle */

s/cycle/iteration/ (here and below). In performance-critical code when 
you say cycle I think machine cycle, and NEON definitely can't process 
16 floats in one of those.

> +   while (len >= 16) {

> +   /* Accumulate results into single float */
> +   tv.val[0] = vadd_f32(vget_low_f32(SUMM), vget_high_f32(SUMM));
> +   tv = vtrn_f32(tv.val[0], ZERO);
> +   tv.val[0] = vadd_f32(tv.val[0], tv.val[1]);
> +
> +   vst1_lane_f32(&sumi, tv.val[0], 0);

Accessing tv.val[0] and tv.val[1] directly seems to send these values 
through the stack, e.g.,

   f4:   f3ba7085        vtrn.32 d7, d5
   f8:   ed0b7b0f        vstr    d7, [fp, #-60]
   fc:   ed0b5b0d        vstr    d5, [fp, #-52]
...
  114:   ed1b6b09        vldr    d6, [fp, #-36]
  118:   ed1b7b0b        vldr    d7, [fp, #-44]
  11c:   f2077d06        vadd.f32        d7, d7, d6
  120:   f483780f        vst1.32 {d7[0]}, [r3]

Can't you just use
   float32x2_t tv;
   tv = vadd_f32(vget_low_f32(SUMM), vget_high_f32(SUMM));
   tv = vpadd_f32(tv, tv);

(you can get rid of ZERO, then, too)

> +   for (i = 0; i < len; i++)
> +      sumi += xi[i] * yi[i];
> +   *sum = sumi;

This bounces things through the stack into vfp registers, which is a 
huge stall. I'd continue to use NEON here with 
vld1_dup_f32()/vmla_f32()/etc.

> +}

> +#if ((defined(OPUS_ARM_ASM) && defined(FIXED_POINT)) \
> +  || defined(OPUS_ARM_NEON_INTR))
>   #include "arm/arm_celt_map.c"
>   #endif
>
> +

Unrelated whitespace change.

> -    inline_optimization="No ASM for your platform, please send patches"
> +    inline_optimization="No in-line ASM for your platform, please send patches"

"inline" is one word.

> +      AC_MSG_CHECKING(if compiler supports arm neon intrinsics)

Capitalize ARM and NEON, please.

> +      save_CFLAGS="$CFLAGS"; CFLAGS="-mfpu=neon $CFLAGS"
> +      AC_COMPILE_IFELSE(

Can we use AC_LINK_IFELSE? We had a problem where sometimes if SSE/AVX 
was not available, but the headers existed, the #include would succeed 
but no functions would get defined. My arm_neon.h seems to be written 
better than that, but I'd like to guard against other implementations 
having similar problems (and keep things consistent with the SSE tests).

> +         [AC_LANG_PROGRAM([[#include <arm_neon.h>]], [])],

You also need to include a call to an actual NEON intrinsic here. If the 
function is not defined, even a call to it here will compile (with an 
implicit declaration warning), but linking will fail.

> +         [
> +            OPUS_ARM_NEON_INTR=1
> +            AC_MSG_RESULT([yes])
> +         ],[
> +            OPUS_ARM_NEON_INR=0

OPUS_ARM_NEON_INTR (you're missing a 'T')

> +            AC_MSG_RESULT([no])
> +         ])
> +      CFLAGS="$save_CFLAGS"
> +      #Now we know if compiler supports ARM neon intrinsics or not
> +
> +      #Currently we only have intrinsic optimization for floating point
> +      AS_IF([test x"$enable_float" = x"yes"],
> +      [
> +         AS_IF([test x"$OPUS_ARM_NEON_INTR" = x"1"],
> +         [
> +            OPUS_ARM_NEON_INTR_CPPFLAGS="-mfpu=neon -O3"

I don't think you should change the optimization level here.