[opus] [PATCH] Refactor silk_LPC_analysis_filter() & Optimize celt_fir_permit_overflow() for ARM NEON

Wed Mar 1 18:58:29 UTC 2017

Linfeng Zhang wrote:
> xcorr_kernel() itself is great and provides many gains. The only issue
> is that calling it in a for loop makes it less efficient.

Do you think it would be possible to improve the API of xcorr_kernel() 
so that calling it in a loop is more efficient?

I haven't looked at an instruction-level profile, but I find it hard to 
believe that the function prologue/epilogue is really responsible for 1% 
to 1.5% of the whole decoder cost. Perhaps it is just bouncing the 
values in and out of memory from the NEON pipeline or something like 
that which is expensive? Otherwise it seems to be doing exactly the same 
things as your celt_fir() (unless I've missed something, which is 
certainly possible).

The other advantage to wiring up xcorr_kernel() is that it applies in 
more places than your intrinsics-only celt_fir() implementation.