[opus] [Aarch64 v2 05/18] Add Neon intrinsics for Silk noise shape quantization.

Mon Nov 23 11:49:32 PST 2015

It's good to know that the compiler is dealing with this properly. I 
withdraw my suggestion.

WebRTC dealt with this issue using conditional compilation depending on 
whether the "__aarch64__" symbol was defined, but that was more than a 
year ago with older compilers.

This all would have been much simpler if ARM had just supplied ARMv7 
macros for the "high" intrinsics in their arm_neon.h header.

On 11/23/2015 11:11 AM, Jonathan Lennox wrote:
>
>> On Nov 23, 2015, at 12:04 PM, John Ridges <jridges at masque.com 
>> <mailto:jridges at masque.com>> wrote:
>>
>> Hi Jonathan.
>>
>> I really, really hate to bring this up this late in the game, but I 
>> just noticed that your NEON code doesn't use any of the "high" 
>> intrinsics for ARM64, e.g. instead of:
>>
>> int32x4_t coef1 = vmovl_s16(vget_high_s16(coef16));
>>
>> you could use:
>>
>> int32x4_t coef1 = vmovl_high_s16(coef16);
>>
>> and instead of:
>>
>> int64x2_t b1 = vmlal_s32(b0, vget_high_s32(a0), vget_high_s32(coef0));
>>
>> you could use:
>>
>> int64x2_t b1 = vmlal_high_s32(b0, a0, coef0);
>>
>> and instead of:
>>
>> int64x1_t c = vadd_s64(vget_low_s64(b3), vget_high_s64(b3));
>> int64x1_t cS = vshr_n_s64(c, 16);
>> int32x2_t d = vreinterpret_s32_s64(cS);
>> out = vget_lane_s32(d, 0);
>>
>> you could use:
>>
>> out = (opus_int32)(vaddvq_s64(b3) >> 16);
>>
>> I understand that ARM added these intrinsics because "vget_high_xxx" 
>> generates an instruction in ARM64, and isn't just free the way it was 
>> in ARMv7 ("vget_low_xxx" is of course still free on both platforms).
>
> Other than the one-intrinsic optimizations, I’d rather keep the Neon 
> intrinsics code compilable on ARMv7 as well as ARM64 — the Neon code 
> is a performance boost for both platforms, and I’d rather not litter 
> it with #ifdef’s unless there’s a large difference between the platforms.
>
> It looks like Clang (the version in Xcode 7.1.1, at least) is smart 
> enough to optimize the first two operations you mention, figuring out 
> sshll2 and smlal2 properly, though the third causes a gratuitous extra 
> “ext.16b” to be generated.  I’ve filed a missed-optimization bug on 
> Clang for the latter.
>
> Here’s the code it generates:
>
> _silk_NSQ_noise_shape_feedback_loop_neon:
> 000000000000004c        ldr      w9, [x0]
> 0000000000000050        cmp      w3, #8
> 0000000000000054        b.ne    0x9c
> 0000000000000058        dup.4s  v0, w9
> 000000000000005c        ldr      q1, [x1]
> 0000000000000060        ext.16b v0, v0, v1, #12
> 0000000000000064        ldur    q1, [x1, #12]
> 0000000000000068        ldr      q2, [x2]
> 000000000000006c        sshll.4s        v3, v2, #0
> 0000000000000070        sshll2.4s       v2, v2, #0
> 0000000000000074        smull.2d        v4, v0, v3
> 0000000000000078        smlal2.2d       v4, v0, v3
> 000000000000007c        smlal.2d        v4, v1, v2
> 0000000000000080        smlal2.2d       v4, v1, v2
> 0000000000000084        ext.16b v2, v4, v4, #8
> 0000000000000088        add     d2, d4, d2
> 000000000000008c        sshr    d2, d2, #16
> 0000000000000090        fmov    w0, s2
> 0000000000000094        stp      q0, q1, [x1]
> 0000000000000098        ret
>
> (Non-vectorized code for non-order-8 omitted.)

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.xiph.org/pipermail/opus/attachments/20151123/3159d711/attachment-0001.htm