[opus] [PATCH] Refactor silk_LPC_analysis_filter() & Optimize celt_fir_permit_overflow() for ARM NEON
Timothy B. Terriberry
tterribe at xiph.org
Wed Mar 1 18:58:29 UTC 2017
Linfeng Zhang wrote:
> xcorr_kernel() itself is great and provides many gains. The only issue
> is that calling it in a for loop makes it less efficient.
Do you think it would be possible to improve the API of xcorr_kernel()
so that calling it in a loop is more efficient?
I haven't looked at an instruction-level profile, but I find it hard to
believe that the function prologue/epilogue is really responsible for 1%
to 1.5% of the whole decoder cost. Perhaps it is just bouncing the
values in and out of memory from the NEON pipeline or something like
that which is expensive? Otherwise it seems to be doing exactly the same
things as your celt_fir() (unless I've missed something, which is
The other advantage to wiring up xcorr_kernel() is that it applies in
more places than your intrinsics-only celt_fir() implementation.
More information about the opus