[opus] [Aarch64 v2 05/18] Add Neon intrinsics for Silk noise shape quantization.
Jonathan Lennox
jonathan at vidyo.com
Mon Nov 23 10:11:49 PST 2015
On Nov 23, 2015, at 12:04 PM, John Ridges <jridges at masque.com<mailto:jridges at masque.com>> wrote:
Hi Jonathan.
I really, really hate to bring this up this late in the game, but I just noticed that your NEON code doesn't use any of the "high" intrinsics for ARM64, e.g. instead of:
int32x4_t coef1 = vmovl_s16(vget_high_s16(coef16));
you could use:
int32x4_t coef1 = vmovl_high_s16(coef16);
and instead of:
int64x2_t b1 = vmlal_s32(b0, vget_high_s32(a0), vget_high_s32(coef0));
you could use:
int64x2_t b1 = vmlal_high_s32(b0, a0, coef0);
and instead of:
int64x1_t c = vadd_s64(vget_low_s64(b3), vget_high_s64(b3));
int64x1_t cS = vshr_n_s64(c, 16);
int32x2_t d = vreinterpret_s32_s64(cS);
out = vget_lane_s32(d, 0);
you could use:
out = (opus_int32)(vaddvq_s64(b3) >> 16);
I understand that ARM added these intrinsics because "vget_high_xxx" generates an instruction in ARM64, and isn't just free the way it was in ARMv7 ("vget_low_xxx" is of course still free on both platforms).
Other than the one-intrinsic optimizations, I’d rather keep the Neon intrinsics code compilable on ARMv7 as well as ARM64 — the Neon code is a performance boost for both platforms, and I’d rather not litter it with #ifdef’s unless there’s a large difference between the platforms.
It looks like Clang (the version in Xcode 7.1.1, at least) is smart enough to optimize the first two operations you mention, figuring out sshll2 and smlal2 properly, though the third causes a gratuitous extra “ext.16b” to be generated. I’ve filed a missed-optimization bug on Clang for the latter.
Here’s the code it generates:
_silk_NSQ_noise_shape_feedback_loop_neon:
000000000000004c ldr w9, [x0]
0000000000000050 cmp w3, #8
0000000000000054 b.ne 0x9c
0000000000000058 dup.4s v0, w9
000000000000005c ldr q1, [x1]
0000000000000060 ext.16b v0, v0, v1, #12
0000000000000064 ldur q1, [x1, #12]
0000000000000068 ldr q2, [x2]
000000000000006c sshll.4s v3, v2, #0
0000000000000070 sshll2.4s v2, v2, #0
0000000000000074 smull.2d v4, v0, v3
0000000000000078 smlal2.2d v4, v0, v3
000000000000007c smlal.2d v4, v1, v2
0000000000000080 smlal2.2d v4, v1, v2
0000000000000084 ext.16b v2, v4, v4, #8
0000000000000088 add d2, d4, d2
000000000000008c sshr d2, d2, #16
0000000000000090 fmov w0, s2
0000000000000094 stp q0, q1, [x1]
0000000000000098 ret
(Non-vectorized code for non-order-8 omitted.)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.xiph.org/pipermail/opus/attachments/20151123/77f930ee/attachment.htm
More information about the opus
mailing list