[opus] [RFC PATCH v2] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
Viswanath Puttagunta
viswanath.puttagunta at linaro.org
Tue Dec 9 12:18:15 PST 2014
On 8 December 2014 at 21:12, Timothy B. Terriberry <tterribe at xiph.org> wrote:
>
> Viswanath Puttagunta wrote:
> > + SUMM = vdupq_n_f32(0);
>
> It kills me that there's no intrinsic for VMOV.F32 d0, #0 (or at least I
> couldn't find one), so this takes two instructions instead of one.
Vish>> Yes, I rechecked the Neon Programmer's guide from ARM.. don't
see a better option here.
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.den0018a/index.html
I will investigate more and discuss with Linaro toolchain engineers
off topic later.
>
> > + /* Consume 4 elements in x vector and 8 elements in y
> > + * vector. However, the 8'th element in y never really gets
> > + * touched in this loop. So, if len == 4, then we only
> > + * must access y[0] to y[6]. y[7] must not be accessed
> > + * hence make sure len>4 and not len>=4
> > + */
> > + while (len > 4) {
>
> This can be an if() instead of a while (though it looks like gcc is
> smart enough to figure that out).
>
> > + /* Just unroll the rest of the loop */
>
> If you're not going to special case the last 2+1+1 samples, is there a
> measurable performance difference compared to simply looping?
Vish>> I got the idea of unrolling this from your earlier comment
"This load is always redundant in the first iteration, which is a bit
unfortunate."
Now, that I unrolled it and tested it to make sure it works, is there
a specific reason you think this is will be any slower than a simple
loop and want to go back? I know it saves a load as compared to a
single loop. It would be hard to prove that it is measurably better
than the simple loop. But I would sincerely prefer it this way.. It is
fairly straight forward code in my opinion.
>
> > + yi++;
> > + switch(len) {
> > + case 4:
> > + XX_2 = vld1_dup_f32(xi++);
> > + SUMM = vmlaq_lane_f32(SUMM, YY[0], XX_2, 0);
> > + YY[0] = vld1q_f32(yi++);
> > + case 3:
> > + XX_2 = vld1_dup_f32(xi++);
> > + SUMM = vmlaq_lane_f32(SUMM, YY[0], XX_2, 0);
> > + YY[0] = vld1q_f32(yi++);
> > + case 2:
> > + XX_2 = vld1_dup_f32(xi++);
> > + SUMM = vmlaq_lane_f32(SUMM, YY[0], XX_2, 0);
> > + YY[0] = vld1q_f32(yi++);
> > + case 1:
> > + XX_2 = vld1_dup_f32(xi++);
> > + SUMM = vmlaq_lane_f32(SUMM, YY[0], XX_2, 0);
> > + }
> > +
> > + vst1q_f32(sum, SUMM);
> > +}
> > +
> > +/*
> > + * Function: xcorr3to1_kernel_neon_float
> > + * ---------------------------------
> > + * Computes single correlation values and stores in *sum
> > + */
> > +void xcorr3to1_kernel_neon_float(const float *x, const float *y,
> > + float *sum, int len) {
>
> I had to think quite a bit about what "3to1" meant (since it is
> describing the context of the caller, not what the actual function
> does). I'd follow the naming convention in the existing
> celt_pitch_xcorr_arm.s, and use "process1", personally.
Vish>> My bad, I initially implemented the loop inside this function, but
later on changed the functionality. I will follow your feedback.
>
> > + int i;
> > + float32x4_t XX[4];
> > + float32x4_t YY[4];
> > + float32x4_t SUMM;
> > + float32x2_t ZERO;
> > + float32x2x2_t tv;
> > + float sumi;
> > + float *xi = x;
> > + float *yi = y;
> > +
> > + ZERO = vdup_n_f32(0);
> > + SUMM = vdupq_n_f32(0);
> > +
> > + /* Work on 16 values per cycle */
>
> s/cycle/iteration/ (here and below). In performance-critical code when
> you say cycle I think machine cycle, and NEON definitely can't process
> 16 floats in one of those.
Vish >> Got it. Thanks.
>
> > + while (len >= 16) {
>
> > + /* Accumulate results into single float */
> > + tv.val[0] = vadd_f32(vget_low_f32(SUMM), vget_high_f32(SUMM));
> > + tv = vtrn_f32(tv.val[0], ZERO);
> > + tv.val[0] = vadd_f32(tv.val[0], tv.val[1]);
> > +
> > + vst1_lane_f32(&sumi, tv.val[0], 0);
>
> Accessing tv.val[0] and tv.val[1] directly seems to send these values
> through the stack, e.g.,
>
> f4: f3ba7085 vtrn.32 d7, d5
> f8: ed0b7b0f vstr d7, [fp, #-60]
> fc: ed0b5b0d vstr d5, [fp, #-52]
> ...
> 114: ed1b6b09 vldr d6, [fp, #-36]
> 118: ed1b7b0b vldr d7, [fp, #-44]
> 11c: f2077d06 vadd.f32 d7, d7, d6
> 120: f483780f vst1.32 {d7[0]}, [r3]
>
> Can't you just use
> float32x2_t tv;
> tv = vadd_f32(vget_low_f32(SUMM), vget_high_f32(SUMM));
> tv = vpadd_f32(tv, tv);
>
> (you can get rid of ZERO, then, too)
Vish>> Thanks for the tip. I missed this vpadd_f32 instruction. I will use this.
I always thought vtrn was a contortionist way to do what I wanted.
Although using tv.val[x] shouldn't have to use the stack... the
compiler should be smarter than that.
I will investigate about this later.. but will use vpadd_f32 for now.
>
> > + for (i = 0; i < len; i++)
> > + sumi += xi[i] * yi[i];
> > + *sum = sumi;
>
> This bounces things through the stack into vfp registers, which is a
> huge stall. I'd continue to use NEON here with
> vld1_dup_f32()/vmla_f32()/etc.
Vish>> OK, will do.
>
> > +}
>
> > +#if ((defined(OPUS_ARM_ASM) && defined(FIXED_POINT)) \
> > + || defined(OPUS_ARM_NEON_INTR))
> > #include "arm/arm_celt_map.c"
> > #endif
> >
> > +
>
> Unrelated whitespace change.
Vish>> OK, will do.
>
> > - inline_optimization="No ASM for your platform, please send patches"
> > + inline_optimization="No in-line ASM for your platform, please send patches"
>
> "inline" is one word.
Vish>> OK, will do.
>
> > + AC_MSG_CHECKING(if compiler supports arm neon intrinsics)
>
> Capitalize ARM and NEON, please.
Vish>> OK, will do.
>
> > + save_CFLAGS="$CFLAGS"; CFLAGS="-mfpu=neon $CFLAGS"
> > + AC_COMPILE_IFELSE(
>
> Can we use AC_LINK_IFELSE? We had a problem where sometimes if SSE/AVX
> was not available, but the headers existed, the #include would succeed
> but no functions would get defined. My arm_neon.h seems to be written
> better than that, but I'd like to guard against other implementations
> having similar problems (and keep things consistent with the SSE tests).
>
> > + [AC_LANG_PROGRAM([[#include <arm_neon.h>]], [])],
>
> You also need to include a call to an actual NEON intrinsic here. If the
> function is not defined, even a call to it here will compile (with an
> implicit declaration warning), but linking will fail.
Vish>> OK, will do.
>
> > + [
> > + OPUS_ARM_NEON_INTR=1
> > + AC_MSG_RESULT([yes])
> > + ],[
> > + OPUS_ARM_NEON_INR=0
>
> OPUS_ARM_NEON_INTR (you're missing a 'T')
Vish>> Good catch. Thanks.
>
> > + AC_MSG_RESULT([no])
> > + ])
> > + CFLAGS="$save_CFLAGS"
> > + #Now we know if compiler supports ARM neon intrinsics or not
> > +
> > + #Currently we only have intrinsic optimization for floating point
> > + AS_IF([test x"$enable_float" = x"yes"],
> > + [
> > + AS_IF([test x"$OPUS_ARM_NEON_INTR" = x"1"],
> > + [
> > + OPUS_ARM_NEON_INTR_CPPFLAGS="-mfpu=neon -O3"
>
> I don't think you should change the optimization level here.
Vish >> OK, will add flags in Makefile.am
> _______________________________________________
> opus mailing list
> opus at xiph.org
> http://lists.xiph.org/mailman/listinfo/opus
More information about the opus
mailing list