[opus] [Aarch64 v2 05/18] Add Neon intrinsics for Silk noise shape quantization.
John Ridges
jridges at masque.com
Mon Nov 23 11:49:32 PST 2015
It's good to know that the compiler is dealing with this properly. I
withdraw my suggestion.
WebRTC dealt with this issue using conditional compilation depending on
whether the "__aarch64__" symbol was defined, but that was more than a
year ago with older compilers.
This all would have been much simpler if ARM had just supplied ARMv7
macros for the "high" intrinsics in their arm_neon.h header.
On 11/23/2015 11:11 AM, Jonathan Lennox wrote:
>
>> On Nov 23, 2015, at 12:04 PM, John Ridges <jridges at masque.com
>> <mailto:jridges at masque.com>> wrote:
>>
>> Hi Jonathan.
>>
>> I really, really hate to bring this up this late in the game, but I
>> just noticed that your NEON code doesn't use any of the "high"
>> intrinsics for ARM64, e.g. instead of:
>>
>> int32x4_t coef1 = vmovl_s16(vget_high_s16(coef16));
>>
>> you could use:
>>
>> int32x4_t coef1 = vmovl_high_s16(coef16);
>>
>> and instead of:
>>
>> int64x2_t b1 = vmlal_s32(b0, vget_high_s32(a0), vget_high_s32(coef0));
>>
>> you could use:
>>
>> int64x2_t b1 = vmlal_high_s32(b0, a0, coef0);
>>
>> and instead of:
>>
>> int64x1_t c = vadd_s64(vget_low_s64(b3), vget_high_s64(b3));
>> int64x1_t cS = vshr_n_s64(c, 16);
>> int32x2_t d = vreinterpret_s32_s64(cS);
>> out = vget_lane_s32(d, 0);
>>
>> you could use:
>>
>> out = (opus_int32)(vaddvq_s64(b3) >> 16);
>>
>> I understand that ARM added these intrinsics because "vget_high_xxx"
>> generates an instruction in ARM64, and isn't just free the way it was
>> in ARMv7 ("vget_low_xxx" is of course still free on both platforms).
>
> Other than the one-intrinsic optimizations, I’d rather keep the Neon
> intrinsics code compilable on ARMv7 as well as ARM64 — the Neon code
> is a performance boost for both platforms, and I’d rather not litter
> it with #ifdef’s unless there’s a large difference between the platforms.
>
> It looks like Clang (the version in Xcode 7.1.1, at least) is smart
> enough to optimize the first two operations you mention, figuring out
> sshll2 and smlal2 properly, though the third causes a gratuitous extra
> “ext.16b” to be generated. I’ve filed a missed-optimization bug on
> Clang for the latter.
>
> Here’s the code it generates:
>
> _silk_NSQ_noise_shape_feedback_loop_neon:
> 000000000000004c ldr w9, [x0]
> 0000000000000050 cmp w3, #8
> 0000000000000054 b.ne 0x9c
> 0000000000000058 dup.4s v0, w9
> 000000000000005c ldr q1, [x1]
> 0000000000000060 ext.16b v0, v0, v1, #12
> 0000000000000064 ldur q1, [x1, #12]
> 0000000000000068 ldr q2, [x2]
> 000000000000006c sshll.4s v3, v2, #0
> 0000000000000070 sshll2.4s v2, v2, #0
> 0000000000000074 smull.2d v4, v0, v3
> 0000000000000078 smlal2.2d v4, v0, v3
> 000000000000007c smlal.2d v4, v1, v2
> 0000000000000080 smlal2.2d v4, v1, v2
> 0000000000000084 ext.16b v2, v4, v4, #8
> 0000000000000088 add d2, d4, d2
> 000000000000008c sshr d2, d2, #16
> 0000000000000090 fmov w0, s2
> 0000000000000094 stp q0, q1, [x1]
> 0000000000000098 ret
>
> (Non-vectorized code for non-order-8 omitted.)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.xiph.org/pipermail/opus/attachments/20151123/3159d711/attachment-0001.htm
More information about the opus
mailing list