[opus] [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON

Linfeng Zhang linfengz at google.com
Tue Apr 11 23:07:59 UTC 2017


Hi Jean-Marc,

Thanks for your suggestions!

I attached the new patch, with inlined reply below.

Thanks,
Linfeng

On Thu, Apr 6, 2017 at 12:55 PM, Jean-Marc Valin <jmvalin at jmvalin.ca> wrote:

> I did some profiling on a Cortex A57 and I've been seeing slightly less
> improvement than you're reporting, more like 3.5% at complexity 8. It
> appears that the warped autocorrelation function itself is only faster
> by a factor of about 1.35. That's a bit surprising considering I see
> nothing obviously wrong with the code.
>

Speed test the new patch, and got about 7.8% whole encoder speed gain with
complexity 8 on my Acre Chromebook.
Here is my configure:
./configure --build x86_64-unknown-linux-gnu --host arm-linux-gnueabihf
--disable-assertions --enable-fixed-point --enable-intrinsics CFLAGS=-O3
--disable-shared

The testing speech file may also change the speed results.


> 1) In calc_state(), rather than splitting the multiply in two
> instructions, you may be able to simply shift the warping left 16 bits,
> then use the Neon instruction that does a*b>>32 (i.e. the one that
> computes the top bits of a 32x32 multiply)
>

Done.


> 2) If the problem is with the movs at the end of each iteration, then
> you should be able to get rid of them by unrolling by a factor of two.
>

We did this previously and get some gains, but the code size is much
bigger. So we abandoned. Tested again on the new code and got no speed
gains.


> 3) It seems likely that you have significant register spill going on due
> to processing up to 24 "taps" at the same time. If that's causing a
> slowdown, then it should be possible to do the processing in "sections".
> By that, I mean that you can implement (e.g.) an order-8 "kernel" that
> computes the correlations and also outputs the last element of
> state_QS_s32x4[0][0] back to input_QS[], so that it can be used to
> compute a new secion.
>

Done. The speed is almost identical (slightly slower), however the extra
bonus is code size saving.

4) It's a minor detail, but the last element of corr_QC[] that's not
> currently vectorized could simply be vectorized independently outside
> the loop (and it's the same for all orders).
>

Done.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xiph.org/pipermail/opus/attachments/20170411/300b590e/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-Optimize-silk_warped_autocorrelation_FIX-for-ARM-NEO.patch
Type: text/x-patch
Size: 29664 bytes
Desc: not available
URL: <http://lists.xiph.org/pipermail/opus/attachments/20170411/300b590e/attachment-0001.bin>


More information about the opus mailing list