[opus] [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON

Linfeng Zhang linfengz at google.com
Tue Apr 11 23:07:59 UTC 2017

Hi Jean-Marc,

Thanks for your suggestions!

I attached the new patch, with inlined reply below.


On Thu, Apr 6, 2017 at 12:55 PM, Jean-Marc Valin <jmvalin at jmvalin.ca> wrote:

> I did some profiling on a Cortex A57 and I've been seeing slightly less
> improvement than you're reporting, more like 3.5% at complexity 8. It
> appears that the warped autocorrelation function itself is only faster
> by a factor of about 1.35. That's a bit surprising considering I see
> nothing obviously wrong with the code.

Speed test the new patch, and got about 7.8% whole encoder speed gain with
complexity 8 on my Acre Chromebook.
Here is my configure:
./configure --build x86_64-unknown-linux-gnu --host arm-linux-gnueabihf
--disable-assertions --enable-fixed-point --enable-intrinsics CFLAGS=-O3

The testing speech file may also change the speed results.

> 1) In calc_state(), rather than splitting the multiply in two
> instructions, you may be able to simply shift the warping left 16 bits,
> then use the Neon instruction that does a*b>>32 (i.e. the one that
> computes the top bits of a 32x32 multiply)


> 2) If the problem is with the movs at the end of each iteration, then
> you should be able to get rid of them by unrolling by a factor of two.

We did this previously and get some gains, but the code size is much
bigger. So we abandoned. Tested again on the new code and got no speed

> 3) It seems likely that you have significant register spill going on due
> to processing up to 24 "taps" at the same time. If that's causing a
> slowdown, then it should be possible to do the processing in "sections".
> By that, I mean that you can implement (e.g.) an order-8 "kernel" that
> computes the correlations and also outputs the last element of
> state_QS_s32x4[0][0] back to input_QS[], so that it can be used to
> compute a new secion.

Done. The speed is almost identical (slightly slower), however the extra
bonus is code size saving.

4) It's a minor detail, but the last element of corr_QC[] that's not
> currently vectorized could simply be vectorized independently outside
> the loop (and it's the same for all orders).

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xiph.org/pipermail/opus/attachments/20170411/300b590e/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-Optimize-silk_warped_autocorrelation_FIX-for-ARM-NEO.patch
Type: text/x-patch
Size: 29664 bytes
Desc: not available
URL: <http://lists.xiph.org/pipermail/opus/attachments/20170411/300b590e/attachment-0001.bin>

More information about the opus mailing list