[opus] [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON

Thu Apr 13 22:33:17 UTC 2017

Hi Linfeng,

Just pushed your patch to master. Turns out the performance improvement
is pretty good on the A53 (RP3). It just looks like the A57 managed to
do a really good job on the non-Neon code. Thanks for taking to time to
address all the issues.

Cheers,

	Jean-Marc

On 12/04/17 09:17 PM, Linfeng Zhang wrote:
> Attached a new patch, which fixes a compiling error.
> 
> Thanks,
> Linfeng
> 
> On Tue, Apr 11, 2017 at 4:07 PM, Linfeng Zhang <linfengz at google.com
> <mailto:linfengz at google.com>> wrote:
> 
>     Hi Jean-Marc,
> 
>     Thanks for your suggestions!
> 
>     I attached the new patch, with inlined reply below.
> 
>     Thanks,
>     Linfeng
> 
>     On Thu, Apr 6, 2017 at 12:55 PM, Jean-Marc Valin <jmvalin at jmvalin.ca
>     <mailto:jmvalin at jmvalin.ca>> wrote:
> 
>         I did some profiling on a Cortex A57 and I've been seeing
>         slightly less
>         improvement than you're reporting, more like 3.5% at complexity
>         8. It
>         appears that the warped autocorrelation function itself is only
>         faster
>         by a factor of about 1.35. That's a bit surprising considering I see
>         nothing obviously wrong with the code.
> 
> 
>     Speed test the new patch, and got about 7.8% whole encoder speed
>     gain with complexity 8 on my Acre Chromebook.
>     Here is my configure:
>     ./configure --build x86_64-unknown-linux-gnu --host
>     arm-linux-gnueabihf --disable-assertions --enable-fixed-point
>     --enable-intrinsics CFLAGS=-O3 --disable-shared
> 
>     The testing speech file may also change the speed results.
>      
> 
>         1) In calc_state(), rather than splitting the multiply in two
>         instructions, you may be able to simply shift the warping left
>         16 bits,
>         then use the Neon instruction that does a*b>>32 (i.e. the one that
>         computes the top bits of a 32x32 multiply)
> 
> 
>     Done.
>      
> 
>         2) If the problem is with the movs at the end of each iteration,
>         then
>         you should be able to get rid of them by unrolling by a factor
>         of two.
> 
> 
>     We did this previously and get some gains, but the code size is much
>     bigger. So we abandoned. Tested again on the new code and got no
>     speed gains.
>      
> 
>         3) It seems likely that you have significant register spill
>         going on due
>         to processing up to 24 "taps" at the same time. If that's causing a
>         slowdown, then it should be possible to do the processing in
>         "sections".
>         By that, I mean that you can implement (e.g.) an order-8
>         "kernel" that
>         computes the correlations and also outputs the last element of
>         state_QS_s32x4[0][0] back to input_QS[], so that it can be used to
>         compute a new secion.
> 
> 
>     Done. The speed is almost identical (slightly slower), however the
>     extra bonus is code size saving.
> 
>         4) It's a minor detail, but the last element of corr_QC[] that's not
>         currently vectorized could simply be vectorized independently
>         outside
>         the loop (and it's the same for all orders).
> 
> 
>     Done.
> 
>