[opus] [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON

Thu Apr 13 01:17:27 UTC 2017

Attached a new patch, which fixes a compiling error.

Thanks,
Linfeng

On Tue, Apr 11, 2017 at 4:07 PM, Linfeng Zhang <linfengz at google.com> wrote:

> Hi Jean-Marc,
>
> Thanks for your suggestions!
>
> I attached the new patch, with inlined reply below.
>
> Thanks,
> Linfeng
>
> On Thu, Apr 6, 2017 at 12:55 PM, Jean-Marc Valin <jmvalin at jmvalin.ca>
> wrote:
>
>> I did some profiling on a Cortex A57 and I've been seeing slightly less
>> improvement than you're reporting, more like 3.5% at complexity 8. It
>> appears that the warped autocorrelation function itself is only faster
>> by a factor of about 1.35. That's a bit surprising considering I see
>> nothing obviously wrong with the code.
>>
>
> Speed test the new patch, and got about 7.8% whole encoder speed gain with
> complexity 8 on my Acre Chromebook.
> Here is my configure:
> ./configure --build x86_64-unknown-linux-gnu --host arm-linux-gnueabihf
> --disable-assertions --enable-fixed-point --enable-intrinsics CFLAGS=-O3
> --disable-shared
>
> The testing speech file may also change the speed results.
>
>
>> 1) In calc_state(), rather than splitting the multiply in two
>> instructions, you may be able to simply shift the warping left 16 bits,
>> then use the Neon instruction that does a*b>>32 (i.e. the one that
>> computes the top bits of a 32x32 multiply)
>>
>
> Done.
>
>
>> 2) If the problem is with the movs at the end of each iteration, then
>> you should be able to get rid of them by unrolling by a factor of two.
>>
>
> We did this previously and get some gains, but the code size is much
> bigger. So we abandoned. Tested again on the new code and got no speed
> gains.
>
>
>> 3) It seems likely that you have significant register spill going on due
>> to processing up to 24 "taps" at the same time. If that's causing a
>> slowdown, then it should be possible to do the processing in "sections".
>> By that, I mean that you can implement (e.g.) an order-8 "kernel" that
>> computes the correlations and also outputs the last element of
>> state_QS_s32x4[0][0] back to input_QS[], so that it can be used to
>> compute a new secion.
>>
>
> Done. The speed is almost identical (slightly slower), however the extra
> bonus is code size saving.
>
> 4) It's a minor detail, but the last element of corr_QC[] that's not
>> currently vectorized could simply be vectorized independently outside
>> the loop (and it's the same for all orders).
>>
>
> Done.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xiph.org/pipermail/opus/attachments/20170412/c3ac1cd0/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-Optimize-silk_warped_autocorrelation_FIX-for-ARM-NEO.patch
Type: text/x-patch
Size: 29705 bytes
Desc: not available
URL: <http://lists.xiph.org/pipermail/opus/attachments/20170412/c3ac1cd0/attachment-0001.bin>