[opus] 2 patches related to silk_biquad_alt() optimization

Tue Apr 25 17:37:14 UTC 2017

On Mon, Apr 24, 2017 at 5:52 PM, Jean-Marc Valin <jmvalin at jmvalin.ca> wrote:

> On 24/04/17 08:03 PM, Linfeng Zhang wrote:
> > Tested on my chromebook, when stride (channel) == 1, the optimization
> > has no gain compared with C function.
>
> You mean that the Neon code is the same speed as the C code for
> stride==1? This is not terribly surprising for an IIRC filter.
>

Yes

>
> > When stride (channel) == 2, the optimization is 1.2%-1.8% faster (1.6%
> > at Complexity 8) compared with C function.
>
> Is that gain due to Neon or simply due to computing two channels in
> parallel? For example, if you make a special case in the C code to
> handle both channels in the same loop, what kind of performance do you get?
>

Tested Complexity 8, it's half half, i.e., 0.8% faster if handling both
channels in the same loop in C, and then additional 0.8% faster using NEON.

>
> > Please let me know and I can remove the optimization of stride 1 case.
>
> Yeah, if there's Neon code that provides no improvement over C, let's
> stick with C. And if you manage to write C code that has the same
> performance as the Neon code, then that would also be better (both
> easier to maintain and more portable).
>

Will do.

>
> > If it's allowed to skip the split of A_Q28 and replace by 32-bit
> > multiplication (result is 64-bit), probably it could be faster on NEON.
> > This may change the encoder results because of different order of
> > adding, shifting and rounding.
>
> I'm not sure what you mean for that.
>

    /* Negate A_Q28 values and split in two parts */
    A0_L_Q28 = ( -A_Q28[ 0 ] ) & 0x00003FFF;        /* lower part */
    A0_U_Q28 = silk_RSHIFT( -A_Q28[ 0 ], 14 );      /* upper part */
    A1_L_Q28 = ( -A_Q28[ 1 ] ) & 0x00003FFF;        /* lower part */
    A1_U_Q28 = silk_RSHIFT( -A_Q28[ 1 ], 14 );      /* upper part */

    ...

        S[ 0 ] = S[1] + silk_RSHIFT_ROUND( silk_SMULWB( out32_Q14, A0_L_Q28
), 14 );
        S[ 0 ] = silk_SMLAWB( S[ 0 ], out32_Q14, A0_U_Q28 );
        S[ 0 ] = silk_SMLAWB( S[ 0 ], B_Q28[ 1 ], inval);

        S[ 1 ] = silk_RSHIFT_ROUND( silk_SMULWB( out32_Q14, A1_L_Q28 ), 14
);
        S[ 1 ] = silk_SMLAWB( S[ 1 ], out32_Q14, A1_U_Q28 );
        S[ 1 ] = silk_SMLAWB( S[ 1 ], B_Q28[ 2 ], inval );

A_Q28 is split to 2 14-bit (or 16-bit, whatever) integers, to make the
multiplication operation within 32-bits. NEON can do 32-bit x 32-bit =
64-bit using 'int64x2_t vmull_s32(int32x2_t a, int32x2_t b)', and it could
possibly be faster and less rounding/shifting errors than above C code. But
it may increase difficulties for other CPUs not supporting 32-bit
multiplication.

Thanks,
Linfeng
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xiph.org/pipermail/opus/attachments/20170425/a44e9427/attachment.html>