[opus] [Aarch64 v2 05/18] Add Neon intrinsics for Silk noise shape quantization.
Timothy B. Terriberry
tterribe at xiph.org
Sat Dec 19 19:07:04 PST 2015
Jonathan Lennox wrote:
> +opus_int32 silk_noise_shape_quantizer_short_prediction_neon(const opus_int32 *buf32, const opus_int32 *coef32)
> +{
> + int32x4_t coef0 = vld1q_s32(coef32);
> + int32x4_t coef1 = vld1q_s32(coef32 + 4);
> + int32x4_t coef2 = vld1q_s32(coef32 + 8);
> + int32x4_t coef3 = vld1q_s32(coef32 + 12);
> +
> + int32x4_t a0 = vld1q_s32(buf32 - 15);
> + int32x4_t a1 = vld1q_s32(buf32 - 11);
> + int32x4_t a2 = vld1q_s32(buf32 - 7);
> + int32x4_t a3 = vld1q_s32(buf32 - 3);
> +
> + int64x2_t b0 = vmull_s32(vget_low_s32(a0), vget_low_s32(coef0));
> + int64x2_t b1 = vmlal_s32(b0, vget_high_s32(a0), vget_high_s32(coef0));
> + int64x2_t b2 = vmlal_s32(b1, vget_low_s32(a1), vget_low_s32(coef1));
> + int64x2_t b3 = vmlal_s32(b2, vget_high_s32(a1), vget_high_s32(coef1));
> + int64x2_t b4 = vmlal_s32(b3, vget_low_s32(a2), vget_low_s32(coef2));
> + int64x2_t b5 = vmlal_s32(b4, vget_high_s32(a2), vget_high_s32(coef2));
> + int64x2_t b6 = vmlal_s32(b5, vget_low_s32(a3), vget_low_s32(coef3));
> + int64x2_t b7 = vmlal_s32(b6, vget_high_s32(a3), vget_high_s32(coef3));
> +
> + int64x1_t c = vadd_s64(vget_low_s64(b7), vget_high_s64(b7));
> + int64x1_t cS = vshr_n_s64(c, 16);
> + int32x2_t d = vreinterpret_s32_s64(cS);
> + opus_int32 out = vget_lane_s32(d, 0);
> + return out;
> +}
So, this is not bit-exact in a portion of the code where I am personally
wary of the problems that might cause, since (like most speech codecs)
we can use slightly unstable filters. If there was a big speed advantage
it might be worth the testing to make sure nothing diverges here
significantly (and it's _probably_ fine), but I think you can actually
do this faster while remaining bitexact.
If you shift up the contents of coef32 by 15 bits (which you can do,
since you are already transforming them specially for this platform),
you can use vqdmulhq_s32() to emulate SMULWB. You then have to do the
addition in a separate instruction, but because you can keep all of the
results in 32-bit, you get double the parallelism and only need half as
many multiplies (which have much higher latency than addition). Overall
it should be faster, and match the C code exactly.
> +#define optional_coef_reversal(out, in, order) do { if (arch == 3) { optional_coef_reversal_neon(out, in, order); } } while (0)
> +
> +#endif
> +
> +opus_int32 silk_noise_shape_quantizer_short_prediction_neon(const opus_int32 *buf32, const opus_int32 *coef32);
> +
> +#if OPUS_ARM_PRESUME_NEON_INTR
> +#undef silk_noise_shape_quantizer_short_prediction
> +#define silk_noise_shape_quantizer_short_prediction(in, coef, coefRev, order, arch) ((void)arch,silk_noise_shape_quantizer_short_prediction_neon(in, coefRev))
> +
> +#elif OPUS_HAVE_RTCD
> +
> +/* silk_noise_shape_quantizer_short_prediction implementations take different parameters based on arch
> + (coef vs. coefRev) so can't use the usual IMPL table implementation */
> +#undef silk_noise_shape_quantizer_short_prediction
> +#define silk_noise_shape_quantizer_short_prediction(in, coef, coefRev, order, arch) (arch == 3 ? silk_noise_shape_quantizer_short_prediction_neon(in, coefRev) : silk_noise_shape_quantizer_short_prediction_c(in, coef, order))
I'm also not wild about these hard-coded 3's. Right now what arch maps
to what number is confined to arm_celt_map.c, which does not use the
indices directly (only sorts its table entries by them). So we never got
named constants for them. But if we have to re-organize what arch
configurations we support, these might change. Random 3's scattered
across the codebase are going to be hard to track down and update.
(also, I realize libopus doesn't have a line-length restriction, but a
few newlines in here might be a mercy to those of us who work in
80-column terminals)
More information about the opus
mailing list