[opus] [RFC V3 7/8] armv7, armv8: Optimize fixed point fft using NE10 library
Timothy B. Terriberry
tterriberry at mozilla.com
Fri Oct 16 05:36:20 PDT 2015
Phil Wang <Phil.Wang at arm.com> wrote:
> Sorry for late reply. I have upstreamed the patch to fix the regression here:
>
> https://github.com/projectNe10/Ne10/commit/ee5d856cd9cb8c4a15ace567df4239f4e788d043
Great! Thanks Phil. If you have not seen, float and fixed FFT and MDCT
patches have landed in libopus master for ARMv7. I split out the
aarch64 parts in the interests of making progress.
BTW, at some point (with gcc 4.7, I believe), I was having compiler
issues with NE10, and needed the following patch:
--- a/modules/dsp/NE10_fft_generic_float32.neonintrinsic.cpp
+++ b/modules/dsp/NE10_fft_generic_float32.neonintrinsic.cpp
@@ -62,18 +62,18 @@ typedef float32x4_t REAL;
vst2q_f32 ((ne10_float32_t*) (PTR), OUT); \
} while (0)
static inline void NE10_LOAD_TW_AND_MUL (CPLX &scratch_in,
const ne10_fft_cpx_float32_t *ptr_in)
{
CPLX scratch_tw;
float32x2_t d2_tmp = vld1_f32 ((ne10_float32_t *)ptr_in);
- scratch_tw.val[0] = NE10_REAL_DUP_NEON_F32 (d2_tmp[0]);
- scratch_tw.val[1] = NE10_REAL_DUP_NEON_F32 (d2_tmp[1]);
+ scratch_tw.val[0] = vdupq_lane_f32 (d2_tmp, 0);
+ scratch_tw.val[1] = vdupq_lane_f32 (d2_tmp, 1);
NE10_CPX_MUL_NEON_F32 (scratch_in, scratch_in, scratch_tw);
}
static inline REAL NE10_S_MUL_NEON_F32 (const REAL vec,
const ne10_float32_t scalar)
{
REAL scalar_neon = NE10_REAL_DUP_NEON_F32 (scalar);
REAL result = scalar_neon * vec;
diff --git a/modules/dsp/NE10_fft_generic_int32.neonintrinsic.h
b/modules/dsp/NE10_fft_generic_int32.neonintrinsic.h
index 0ded4a3..6561ae3 100644
--- a/modules/dsp/NE10_fft_generic_int32.neonintrinsic.h
+++ b/modules/dsp/NE10_fft_generic_int32.neonintrinsic.h
@@ -150,18 +150,18 @@ static inline void NE10_CPX_MUL_NEON_S32 (CPLX
&result, const CPLX A, const CPLX
template<int RADIX>
inline void NE10_LOAD_TW_AND_MUL (CPLX scratch_in[RADIX],
const ne10_fft_cpx_int32_t *ptr_in,
const ne10_int32_t step)
{
CPLX scratch_tw;
int32x2_t d2_tmp = vld1_s32 ((ne10_int32_t *)(ptr_in + (RADIX -
2) * step));
- scratch_tw.val[0] = NE10_REAL_DUP_NEON_S32 (d2_tmp[0]);
- scratch_tw.val[1] = NE10_REAL_DUP_NEON_S32 (d2_tmp[1]);
+ scratch_tw.val[0] = vdupq_lane_s32 (d2_tmp, 0);
+ scratch_tw.val[1] = vdupq_lane_s32 (d2_tmp, 1);
NE10_CPX_MUL_NEON_S32 (scratch_in[RADIX - 1], scratch_in[RADIX -
1], scratch_tw);
NE10_LOAD_TW_AND_MUL<RADIX - 1> (scratch_in, ptr_in, step);
}
template<>
inline void NE10_LOAD_TW_AND_MUL<1> (CPLX [1],
const ne10_fft_cpx_int32_t *,
I'm sure that's not the right solution, since I didn't really
understand what the NE10_REAL_DUP_NEON_* macros were trying to
accomplish (or even why they needed to be macros). Wtih gcc 4.8.2
(what I wound up using for final testing on ARMv7), I don't believe
these changes were necessary. However, by my naive reading of the code
it seemed like the versions in NE10 now will bounce things through ARM
registers (which would, of course, be very slow). I didn't check the
generated asm to confirm, though. You may want to take a look.
More information about the opus
mailing list