[opus] [Aarch64 v2 05/18] Add Neon intrinsics for Silk noise shape quantization.

Mon Dec 21 07:37:54 PST 2015

> On Dec 19, 2015, at 10:07 PM, Timothy B. Terriberry <tterribe at xiph.org> wrote:
> 
> Jonathan Lennox wrote:
>> +opus_int32 silk_noise_shape_quantizer_short_prediction_neon(const opus_int32 *buf32, const opus_int32 *coef32)
>> +{
>> +    int32x4_t coef0 = vld1q_s32(coef32);
>> +    int32x4_t coef1 = vld1q_s32(coef32 + 4);
>> +    int32x4_t coef2 = vld1q_s32(coef32 + 8);
>> +    int32x4_t coef3 = vld1q_s32(coef32 + 12);
>> +
>> +    int32x4_t a0 = vld1q_s32(buf32 - 15);
>> +    int32x4_t a1 = vld1q_s32(buf32 - 11);
>> +    int32x4_t a2 = vld1q_s32(buf32 - 7);
>> +    int32x4_t a3 = vld1q_s32(buf32 - 3);
>> +
>> +    int64x2_t b0 = vmull_s32(vget_low_s32(a0), vget_low_s32(coef0));
>> +    int64x2_t b1 = vmlal_s32(b0, vget_high_s32(a0), vget_high_s32(coef0));
>> +    int64x2_t b2 = vmlal_s32(b1, vget_low_s32(a1), vget_low_s32(coef1));
>> +    int64x2_t b3 = vmlal_s32(b2, vget_high_s32(a1), vget_high_s32(coef1));
>> +    int64x2_t b4 = vmlal_s32(b3, vget_low_s32(a2), vget_low_s32(coef2));
>> +    int64x2_t b5 = vmlal_s32(b4, vget_high_s32(a2), vget_high_s32(coef2));
>> +    int64x2_t b6 = vmlal_s32(b5, vget_low_s32(a3), vget_low_s32(coef3));
>> +    int64x2_t b7 = vmlal_s32(b6, vget_high_s32(a3), vget_high_s32(coef3));
>> +
>> +    int64x1_t c = vadd_s64(vget_low_s64(b7), vget_high_s64(b7));
>> +    int64x1_t cS = vshr_n_s64(c, 16);
>> +    int32x2_t d = vreinterpret_s32_s64(cS);
>> +    opus_int32 out = vget_lane_s32(d, 0);
>> +    return out;
>> +}
> 
> So, this is not bit-exact in a portion of the code where I am personally 
> wary of the problems that might cause, since (like most speech codecs) 
> we can use slightly unstable filters. If there was a big speed advantage 
> it might be worth the testing to make sure nothing diverges here 
> significantly (and it's _probably_ fine), but I think you can actually 
> do this faster while remaining bitexact.
> 
> If you shift up the contents of coef32 by 15 bits (which you can do, 
> since you are already transforming them specially for this platform), 
> you can use vqdmulhq_s32() to emulate SMULWB. You then have to do the 
> addition in a separate instruction, but because you can keep all of the 
> results in 32-bit, you get double the parallelism and only need half as 
> many multiplies (which have much higher latency than addition). Overall 
> it should be faster, and match the C code exactly.

Okay — I’ll consult with the guy who wrote the code originally, and try it out.

> 
>> +#define optional_coef_reversal(out, in, order) do { if (arch == 3) { optional_coef_reversal_neon(out, in, order); } } while (0)
>> +
>> +#endif
>> +
>> +opus_int32 silk_noise_shape_quantizer_short_prediction_neon(const opus_int32 *buf32, const opus_int32 *coef32);
>> +
>> +#if OPUS_ARM_PRESUME_NEON_INTR
>> +#undef silk_noise_shape_quantizer_short_prediction
>> +#define silk_noise_shape_quantizer_short_prediction(in, coef, coefRev, order, arch)  ((void)arch,silk_noise_shape_quantizer_short_prediction_neon(in, coefRev))
>> +
>> +#elif OPUS_HAVE_RTCD
>> +
>> +/* silk_noise_shape_quantizer_short_prediction implementations take different parameters based on arch
>> +   (coef vs. coefRev) so can't use the usual IMPL table implementation */
>> +#undef silk_noise_shape_quantizer_short_prediction
>> +#define silk_noise_shape_quantizer_short_prediction(in, coef, coefRev, order, arch)  (arch == 3 ? silk_noise_shape_quantizer_short_prediction_neon(in, coefRev) : silk_noise_shape_quantizer_short_prediction_c(in, coef, order))
> 
> I'm also not wild about these hard-coded 3's. Right now what arch maps 
> to what number is confined to arm_celt_map.c, which does not use the 
> indices directly (only sorts its table entries by them). So we never got 
> named constants for them. But if we have to re-organize what arch 
> configurations we support, these might change. Random 3's scattered 
> across the codebase are going to be hard to track down and update.

I see your point — what do you suggest instead?   As the comment mentions, this can’t use the usual IMPL table implementation, because the parameters are different (due to the transformed coefficients.)

Would something like an OPUS_ARCH_ARM_NEON #define in celt/arm/armcpu.h be okay?  It’d be a bit confusing with OPUS_CPU_ARM_NEON in armcpu.c, but I could probably organize it to be sensible.

> (also, I realize libopus doesn't have a line-length restriction, but a 
> few newlines in here might be a mercy to those of us who work in 
> 80-column terminals)

Okay.