[opus] [Aarch64 v2 05/18] Add Neon intrinsics for Silk noise shape quantization.

Sat Dec 19 19:07:04 PST 2015

Jonathan Lennox wrote:
> +opus_int32 silk_noise_shape_quantizer_short_prediction_neon(const opus_int32 *buf32, const opus_int32 *coef32)
> +{
> +    int32x4_t coef0 = vld1q_s32(coef32);
> +    int32x4_t coef1 = vld1q_s32(coef32 + 4);
> +    int32x4_t coef2 = vld1q_s32(coef32 + 8);
> +    int32x4_t coef3 = vld1q_s32(coef32 + 12);
> +
> +    int32x4_t a0 = vld1q_s32(buf32 - 15);
> +    int32x4_t a1 = vld1q_s32(buf32 - 11);
> +    int32x4_t a2 = vld1q_s32(buf32 - 7);
> +    int32x4_t a3 = vld1q_s32(buf32 - 3);
> +
> +    int64x2_t b0 = vmull_s32(vget_low_s32(a0), vget_low_s32(coef0));
> +    int64x2_t b1 = vmlal_s32(b0, vget_high_s32(a0), vget_high_s32(coef0));
> +    int64x2_t b2 = vmlal_s32(b1, vget_low_s32(a1), vget_low_s32(coef1));
> +    int64x2_t b3 = vmlal_s32(b2, vget_high_s32(a1), vget_high_s32(coef1));
> +    int64x2_t b4 = vmlal_s32(b3, vget_low_s32(a2), vget_low_s32(coef2));
> +    int64x2_t b5 = vmlal_s32(b4, vget_high_s32(a2), vget_high_s32(coef2));
> +    int64x2_t b6 = vmlal_s32(b5, vget_low_s32(a3), vget_low_s32(coef3));
> +    int64x2_t b7 = vmlal_s32(b6, vget_high_s32(a3), vget_high_s32(coef3));
> +
> +    int64x1_t c = vadd_s64(vget_low_s64(b7), vget_high_s64(b7));
> +    int64x1_t cS = vshr_n_s64(c, 16);
> +    int32x2_t d = vreinterpret_s32_s64(cS);
> +    opus_int32 out = vget_lane_s32(d, 0);
> +    return out;
> +}

So, this is not bit-exact in a portion of the code where I am personally 
wary of the problems that might cause, since (like most speech codecs) 
we can use slightly unstable filters. If there was a big speed advantage 
it might be worth the testing to make sure nothing diverges here 
significantly (and it's _probably_ fine), but I think you can actually 
do this faster while remaining bitexact.

If you shift up the contents of coef32 by 15 bits (which you can do, 
since you are already transforming them specially for this platform), 
you can use vqdmulhq_s32() to emulate SMULWB. You then have to do the 
addition in a separate instruction, but because you can keep all of the 
results in 32-bit, you get double the parallelism and only need half as 
many multiplies (which have much higher latency than addition). Overall 
it should be faster, and match the C code exactly.

> +#define optional_coef_reversal(out, in, order) do { if (arch == 3) { optional_coef_reversal_neon(out, in, order); } } while (0)
> +
> +#endif
> +
> +opus_int32 silk_noise_shape_quantizer_short_prediction_neon(const opus_int32 *buf32, const opus_int32 *coef32);
> +
> +#if OPUS_ARM_PRESUME_NEON_INTR
> +#undef silk_noise_shape_quantizer_short_prediction
> +#define silk_noise_shape_quantizer_short_prediction(in, coef, coefRev, order, arch)  ((void)arch,silk_noise_shape_quantizer_short_prediction_neon(in, coefRev))
> +
> +#elif OPUS_HAVE_RTCD
> +
> +/* silk_noise_shape_quantizer_short_prediction implementations take different parameters based on arch
> +   (coef vs. coefRev) so can't use the usual IMPL table implementation */
> +#undef silk_noise_shape_quantizer_short_prediction
> +#define silk_noise_shape_quantizer_short_prediction(in, coef, coefRev, order, arch)  (arch == 3 ? silk_noise_shape_quantizer_short_prediction_neon(in, coefRev) : silk_noise_shape_quantizer_short_prediction_c(in, coef, order))

I'm also not wild about these hard-coded 3's. Right now what arch maps 
to what number is confined to arm_celt_map.c, which does not use the 
indices directly (only sorts its table entries by them). So we never got 
named constants for them. But if we have to re-organize what arch 
configurations we support, these might change. Random 3's scattered 
across the codebase are going to be hard to track down and update.

(also, I realize libopus doesn't have a line-length restriction, but a 
few newlines in here might be a mercy to those of us who work in 
80-column terminals)