[opus] 2 patches related to silk_biquad_alt() optimization

Wed Apr 26 05:31:49 UTC 2017

On 25/04/17 01:37 PM, Linfeng Zhang wrote:
>     Is that gain due to Neon or simply due to computing two channels in
>     parallel? For example, if you make a special case in the C code to
>     handle both channels in the same loop, what kind of performance do
>     you get?
> 
> 
> Tested Complexity 8, it's half half, i.e., 0.8% faster if handling both
> channels in the same loop in C, and then additional 0.8% faster using NEON.

Considering that the function isn't huge, I'm OK in principle adding
some Neon to gain 0.8%. It would just be good to check that the 0.8%
indeed comes from Neon as opposed to just unrolling the channels.

> A_Q28 is split to 2 14-bit (or 16-bit, whatever) integers, to make the
> multiplication operation within 32-bits. NEON can do 32-bit x 32-bit =
> 64-bit using 'int64x2_t vmull_s32(int32x2_t a, int32x2_t b)', and it
> could possibly be faster and less rounding/shifting errors than above C
> code. But it may increase difficulties for other CPUs not supporting
> 32-bit multiplication.

OK, so I'm not totally opposed to that, but it increases the
testing/maintenance cost so it needs to be worth it. So the question is
how much speedup can you get and how close you can make the result to
the original function. If you can make the output be always within one
of two LSBs of the C version, then the asm check can simply be a little
bit more lax than usual. Otherwise it becomes more complicated. This
isn't a function that scares me too much about going non-bitexact, but
it's also not one of the big complexity costs either. In any case, let
me know what you find.

Cheers,

	Jean-Marc