[opus] [RFC PATCH v2] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
Timothy B. Terriberry
tterribe at xiph.org
Mon Dec 8 19:12:50 PST 2014
Viswanath Puttagunta wrote:
> + SUMM = vdupq_n_f32(0);
It kills me that there's no intrinsic for VMOV.F32 d0, #0 (or at least I
couldn't find one), so this takes two instructions instead of one.
> + /* Consume 4 elements in x vector and 8 elements in y
> + * vector. However, the 8'th element in y never really gets
> + * touched in this loop. So, if len == 4, then we only
> + * must access y[0] to y[6]. y[7] must not be accessed
> + * hence make sure len>4 and not len>=4
> + */
> + while (len > 4) {
This can be an if() instead of a while (though it looks like gcc is
smart enough to figure that out).
> + /* Just unroll the rest of the loop */
If you're not going to special case the last 2+1+1 samples, is there a
measurable performance difference compared to simply looping?
> + yi++;
> + switch(len) {
> + case 4:
> + XX_2 = vld1_dup_f32(xi++);
> + SUMM = vmlaq_lane_f32(SUMM, YY[0], XX_2, 0);
> + YY[0] = vld1q_f32(yi++);
> + case 3:
> + XX_2 = vld1_dup_f32(xi++);
> + SUMM = vmlaq_lane_f32(SUMM, YY[0], XX_2, 0);
> + YY[0] = vld1q_f32(yi++);
> + case 2:
> + XX_2 = vld1_dup_f32(xi++);
> + SUMM = vmlaq_lane_f32(SUMM, YY[0], XX_2, 0);
> + YY[0] = vld1q_f32(yi++);
> + case 1:
> + XX_2 = vld1_dup_f32(xi++);
> + SUMM = vmlaq_lane_f32(SUMM, YY[0], XX_2, 0);
> + }
> +
> + vst1q_f32(sum, SUMM);
> +}
> +
> +/*
> + * Function: xcorr3to1_kernel_neon_float
> + * ---------------------------------
> + * Computes single correlation values and stores in *sum
> + */
> +void xcorr3to1_kernel_neon_float(const float *x, const float *y,
> + float *sum, int len) {
I had to think quite a bit about what "3to1" meant (since it is
describing the context of the caller, not what the actual function
does). I'd follow the naming convention in the existing
celt_pitch_xcorr_arm.s, and use "process1", personally.
> + int i;
> + float32x4_t XX[4];
> + float32x4_t YY[4];
> + float32x4_t SUMM;
> + float32x2_t ZERO;
> + float32x2x2_t tv;
> + float sumi;
> + float *xi = x;
> + float *yi = y;
> +
> + ZERO = vdup_n_f32(0);
> + SUMM = vdupq_n_f32(0);
> +
> + /* Work on 16 values per cycle */
s/cycle/iteration/ (here and below). In performance-critical code when
you say cycle I think machine cycle, and NEON definitely can't process
16 floats in one of those.
> + while (len >= 16) {
> + /* Accumulate results into single float */
> + tv.val[0] = vadd_f32(vget_low_f32(SUMM), vget_high_f32(SUMM));
> + tv = vtrn_f32(tv.val[0], ZERO);
> + tv.val[0] = vadd_f32(tv.val[0], tv.val[1]);
> +
> + vst1_lane_f32(&sumi, tv.val[0], 0);
Accessing tv.val[0] and tv.val[1] directly seems to send these values
through the stack, e.g.,
f4: f3ba7085 vtrn.32 d7, d5
f8: ed0b7b0f vstr d7, [fp, #-60]
fc: ed0b5b0d vstr d5, [fp, #-52]
...
114: ed1b6b09 vldr d6, [fp, #-36]
118: ed1b7b0b vldr d7, [fp, #-44]
11c: f2077d06 vadd.f32 d7, d7, d6
120: f483780f vst1.32 {d7[0]}, [r3]
Can't you just use
float32x2_t tv;
tv = vadd_f32(vget_low_f32(SUMM), vget_high_f32(SUMM));
tv = vpadd_f32(tv, tv);
(you can get rid of ZERO, then, too)
> + for (i = 0; i < len; i++)
> + sumi += xi[i] * yi[i];
> + *sum = sumi;
This bounces things through the stack into vfp registers, which is a
huge stall. I'd continue to use NEON here with
vld1_dup_f32()/vmla_f32()/etc.
> +}
> +#if ((defined(OPUS_ARM_ASM) && defined(FIXED_POINT)) \
> + || defined(OPUS_ARM_NEON_INTR))
> #include "arm/arm_celt_map.c"
> #endif
>
> +
Unrelated whitespace change.
> - inline_optimization="No ASM for your platform, please send patches"
> + inline_optimization="No in-line ASM for your platform, please send patches"
"inline" is one word.
> + AC_MSG_CHECKING(if compiler supports arm neon intrinsics)
Capitalize ARM and NEON, please.
> + save_CFLAGS="$CFLAGS"; CFLAGS="-mfpu=neon $CFLAGS"
> + AC_COMPILE_IFELSE(
Can we use AC_LINK_IFELSE? We had a problem where sometimes if SSE/AVX
was not available, but the headers existed, the #include would succeed
but no functions would get defined. My arm_neon.h seems to be written
better than that, but I'd like to guard against other implementations
having similar problems (and keep things consistent with the SSE tests).
> + [AC_LANG_PROGRAM([[#include <arm_neon.h>]], [])],
You also need to include a call to an actual NEON intrinsic here. If the
function is not defined, even a call to it here will compile (with an
implicit declaration warning), but linking will fail.
> + [
> + OPUS_ARM_NEON_INTR=1
> + AC_MSG_RESULT([yes])
> + ],[
> + OPUS_ARM_NEON_INR=0
OPUS_ARM_NEON_INTR (you're missing a 'T')
> + AC_MSG_RESULT([no])
> + ])
> + CFLAGS="$save_CFLAGS"
> + #Now we know if compiler supports ARM neon intrinsics or not
> +
> + #Currently we only have intrinsic optimization for floating point
> + AS_IF([test x"$enable_float" = x"yes"],
> + [
> + AS_IF([test x"$OPUS_ARM_NEON_INTR" = x"1"],
> + [
> + OPUS_ARM_NEON_INTR_CPPFLAGS="-mfpu=neon -O3"
I don't think you should change the optimization level here.
More information about the opus
mailing list