[opus] [PATCH] Refactor silk_LPC_analysis_filter() & Optimize celt_fir_permit_overflow() for ARM NEON

Linfeng Zhang linfengz at google.com
Wed Mar 1 18:49:48 UTC 2017


>
> I believe the solution would be to always have either:
> 1) USE_CELT_FIR=1 and use ovflw() macros in the xcorr code; or
> 2) USE_CELT_FIR=0 and no ovflw() in the xcorr code
>

I prefer to create a function named silk_fir() with optimization to do the
calculation when USE_CELT_FIR=0.

xcorr_kernel() itself is great and provides many gains. The only issue is
that calling it in a for loop makes it less efficient.

xcorr_kernel() is called in several functions including
celt_fir(), celt_pitch_xcorr() and celt_iir(). All these functions are not
heavy hitters.
silk_LPC_analysis_filter()'s CPU cycles are 6.8% with complexity 8 and 8.9%
with complexity 5 out of the whole encoder. It probably makes sense to have
a specific optimization to not calling xcorr_kernel() too many times to
save 1% to 1.5% CPU cycles here.

How do you think?

Thanks,
Linfeng


> We can then switch between the two cases at compile time. I think 2) is
> appropriate when we have --enable-assertions without --enable-check-asm,
> wheras 1) is appropriate the rest of the time. That way we can test for
> overflows in the CELT code, without preventing optimization of the SILK
> code.
>
> > Because of silk_LPC_analysis_filter(), celt_fir_permit_overflow() must
> > behave the same for both floating-point and fixed-point, and this is why
> > we defined ADD32_FIXED(), ..., PSHR32_FIXED() etc.
>
> I don't think you will need these anymore, but if you ever need
> fixed-point macros that remain integer for float compilation, then you
> should use the silk_*() fixed-point macros (and the code should be in
> silk/).
>
> > For the NEON optimization part, the previous celt_fir() optimization
> > calls xcorr_kernel(). We tested and found that calling
> > the xcorr_kernel() optimization didn't help too much here. The
> > optimization in the patch is about 1% faster than simply
> > calling xcorr_kernel() for the whole encoder. Considering the really
> > small size of the new optimization, it's better to not call
> > xcorr_kernel() to get 1% faster.
>
> It seems like the new Neon code does essentially the same thing as the
> existing xcorr_kernel(), so the two should remain shared to reduce
> maintenance burden and maximize speed. If you're able to come up with
> something faster than xcorr_kernel() (which apparently you have), then
> it seems like it should be usable by the other functions that currently
> use xcorr_kernel(). Or did I miss anything here?
>
> Cheers,
>
>         Jean-Marc
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xiph.org/pipermail/opus/attachments/20170301/89ed0208/attachment.html>


More information about the opus mailing list