[opus] [PATCH] Refactor silk_LPC_analysis_filter() & Optimize celt_fir_permit_overflow() for ARM NEON

Sat Feb 18 00:30:17 UTC 2017

Hi Linfeng,

On 15/02/17 04:05 PM, Linfeng Zhang wrote:
> The original celt_fir() is a little bit messy. It has 2 branches chosen
> by #ifdef SMALL_FOOTPRINT.

Yeah, I agree that the #ifdef SMALL_FOOTPRINT in celt_fir() is a bit of
overkill since it's not saving much code space. I just pushed a commit
that gets rid of it, also refactoring the #else case a bit (see below).

> For floating-point, the 2 branches are identical (except the operation
> sequence of accumulating x[i] to sum, which is not a big deal).
> For fixed-point, the 2 branches are different. I separate them into 2
> functions: the new celt_fir(), and celt_fir_permit_overflow() which is
> the SMALL_FOOTPRINT branch.
> The only difference for fixed-point is:
> celt_fir(): the sum is truncated first and then accumulated to x[i] and
> saturated.
> celt_fir_permit_overflow(): x[i] is accumulated to the sum first and
> then truncated saturated.

Actually, the two branches are bit-exact for fixed-point. There is
indeed a difference in where x[i] gets accumulated, but because in the
SMALL_FOOTPRINT case it first gets shifted up by SIG_SHIFT, the result
of the downshift (also by SIG_SHIFT) is the same no matter when it gets
added. That being said, I thought adding at the beginning was nicer so I
changed the remaining code to do that.

> Maybe this is the reason why silk_LPC_analysis_filter() switched the FIR
> from celt_fir() to celt_fir_permit_overflow() half a year ago.

No, it's another issue. silk_LPC_analysis_filter() was always bit-exact
with celt_fir(), with or without SMALL_FOOTPRINT. The only difference
lies with the signed integer overflow suppression.
silk_LPC_analysis_filter() relies on the knowledge that signed integer
overflows can occur during the accumulation, but they are guaranteed to
cancel each other (i.e. equal wrap-arounds in each direction). For that
reason, we use silk_SMLABB_ovflw() which casts to unsigned to avoid the
undefined behaviour and thus the ubsan warnings. In celt_fir() and the
pitch correlation code it uses, we know there should not be signed
overflows, so we would like to detect any problem when using ubsan.

I believe the solution would be to always have either:
1) USE_CELT_FIR=1 and use ovflw() macros in the xcorr code; or
2) USE_CELT_FIR=0 and no ovflw() in the xcorr code

We can then switch between the two cases at compile time. I think 2) is
appropriate when we have --enable-assertions without --enable-check-asm,
wheras 1) is appropriate the rest of the time. That way we can test for
overflows in the CELT code, without preventing optimization of the SILK
code.

> Because of silk_LPC_analysis_filter(), celt_fir_permit_overflow() must
> behave the same for both floating-point and fixed-point, and this is why
> we defined ADD32_FIXED(), ..., PSHR32_FIXED() etc.

I don't think you will need these anymore, but if you ever need
fixed-point macros that remain integer for float compilation, then you
should use the silk_*() fixed-point macros (and the code should be in
silk/).

> For the NEON optimization part, the previous celt_fir() optimization
> calls xcorr_kernel(). We tested and found that calling
> the xcorr_kernel() optimization didn't help too much here. The
> optimization in the patch is about 1% faster than simply
> calling xcorr_kernel() for the whole encoder. Considering the really
> small size of the new optimization, it's better to not call
> xcorr_kernel() to get 1% faster.

It seems like the new Neon code does essentially the same thing as the
existing xcorr_kernel(), so the two should remain shared to reduce
maintenance burden and maximize speed. If you're able to come up with
something faster than xcorr_kernel() (which apparently you have), then
it seems like it should be usable by the other functions that currently
use xcorr_kernel(). Or did I miss anything here?

Cheers,

	Jean-Marc