[opus] [RFC V3 7/8] armv7, armv8: Optimize fixed point fft using NE10 library

Timothy B. Terriberry tterriberry at mozilla.com
Fri Oct 16 05:36:20 PDT 2015

Phil Wang <Phil.Wang at arm.com> wrote:
> Sorry for late reply. I have upstreamed the patch to fix the regression here:
> https://github.com/projectNe10/Ne10/commit/ee5d856cd9cb8c4a15ace567df4239f4e788d043

Great! Thanks Phil. If you have not seen, float and fixed FFT and MDCT  
patches have landed in libopus master for ARMv7. I split out the  
aarch64 parts in the interests of making progress.

BTW, at some point (with gcc 4.7, I believe), I was having compiler  
issues with NE10, and needed the following patch:

--- a/modules/dsp/NE10_fft_generic_float32.neonintrinsic.cpp
+++ b/modules/dsp/NE10_fft_generic_float32.neonintrinsic.cpp
@@ -62,18 +62,18 @@ typedef float32x4_t   REAL;
          vst2q_f32 ((ne10_float32_t*) (PTR), OUT); \
      } while (0)

  static inline void NE10_LOAD_TW_AND_MUL (CPLX &scratch_in,
          const ne10_fft_cpx_float32_t *ptr_in)
      CPLX scratch_tw;
      float32x2_t d2_tmp = vld1_f32 ((ne10_float32_t *)ptr_in);
-    scratch_tw.val[0] = NE10_REAL_DUP_NEON_F32 (d2_tmp[0]);
-    scratch_tw.val[1] = NE10_REAL_DUP_NEON_F32 (d2_tmp[1]);
+    scratch_tw.val[0] = vdupq_lane_f32 (d2_tmp, 0);
+    scratch_tw.val[1] = vdupq_lane_f32 (d2_tmp, 1);
      NE10_CPX_MUL_NEON_F32 (scratch_in, scratch_in, scratch_tw);

  static inline REAL NE10_S_MUL_NEON_F32 (const REAL vec,
          const ne10_float32_t scalar)
      REAL scalar_neon = NE10_REAL_DUP_NEON_F32 (scalar);
      REAL result = scalar_neon * vec;
diff --git a/modules/dsp/NE10_fft_generic_int32.neonintrinsic.h  
index 0ded4a3..6561ae3 100644
--- a/modules/dsp/NE10_fft_generic_int32.neonintrinsic.h
+++ b/modules/dsp/NE10_fft_generic_int32.neonintrinsic.h
@@ -150,18 +150,18 @@ static inline void NE10_CPX_MUL_NEON_S32 (CPLX  
&result, const CPLX A, const CPLX
  template<int RADIX>
  inline void NE10_LOAD_TW_AND_MUL (CPLX scratch_in[RADIX],
          const ne10_fft_cpx_int32_t *ptr_in,
          const ne10_int32_t step)
      CPLX scratch_tw;
      int32x2_t d2_tmp = vld1_s32 ((ne10_int32_t *)(ptr_in + (RADIX -  
2) * step));

-    scratch_tw.val[0] = NE10_REAL_DUP_NEON_S32 (d2_tmp[0]);
-    scratch_tw.val[1] = NE10_REAL_DUP_NEON_S32 (d2_tmp[1]);
+    scratch_tw.val[0] = vdupq_lane_s32 (d2_tmp, 0);
+    scratch_tw.val[1] = vdupq_lane_s32 (d2_tmp, 1);
      NE10_CPX_MUL_NEON_S32 (scratch_in[RADIX - 1], scratch_in[RADIX -  
1], scratch_tw);

      NE10_LOAD_TW_AND_MUL<RADIX - 1> (scratch_in, ptr_in, step);

  inline void NE10_LOAD_TW_AND_MUL<1> (CPLX [1],
          const ne10_fft_cpx_int32_t *,

I'm sure that's not the right solution, since I didn't really  
understand what the NE10_REAL_DUP_NEON_* macros were trying to  
accomplish (or even why they needed to be macros). Wtih gcc 4.8.2  
(what I wound up using for final testing on ARMv7), I don't believe  
these changes were necessary. However, by my naive reading of the code  
it seemed like the versions in NE10 now will bounce things through ARM  
registers (which would, of course, be very slow). I didn't check the  
generated asm to confirm, though. You may want to take a look.

More information about the opus mailing list