[opus] [OPUS] celt_inner_prod() and dual_inner_prod() NEON intrinsics

Tue Jun 6 18:27:24 UTC 2017

Thank Ulrich!

Yes, using
        celt_assert(1.0 + celt_inner_prod_neon_float_c_simulation(x, y, N)
== 1.0 + xy);
        celt_assert(1.0 + xy1_c == 1.0 + *xy1);
        celt_assert(1.0 + xy2_c == 1.0 + *xy2);
can avoid the useage of VERY_SMALL.

Hi Jean-Marc,

I added
    {
        const opus_val32 xy_c = celt_inner_prod_neon_float_c_simulation(x,
y, N);
        const int32_t *x_bin = (int32_t*)x;
        const int32_t *y_bin = (int32_t*)y;
        const int32_t *xy_bin = (int32_t*)&xy;
        const int32_t *xy_bin_c = (int32_t*)&xy_c;
        // if((xy_c != xy) && (xy_c != 0.0) && (xy != 0.0)) {
        if(xy_c != xy) {
            printf("\n xy_c = %9f, xy   = %9f", xy_c, xy);
            printf(" | xy_c = %13e, xy   = %13e", xy_c, xy);
            printf(" | xy_c (bin) = 0x%8x, xy   (bin) = 0x%8x\n",
*xy_bin_c, *xy_bin);
            printf("\n N = %d", N);
            for (i = 0; i < N; i++) {
              printf("\n x[%d] = %9f, y[%d] = %9f", i, x[i], i, y[i]);
              printf(" | x[%d] = %13e, y[%d] = %13e", i, x[i], i, y[i]);
              printf(" | x[%d] (bin) = 0x%8x, y[%d] (bin) = 0x%8x", i,
x_bin[i], i, y_bin[i]);
            }
            printf("\n\n");
        }
    }

And got the following two cases when testing speech_mono_32_48kHz.pcm
(Download: https://drive.google.com/file/d/0B2bjttuYjfVYaHBDZE1XV3B0MHM) on
NEON:

$ ./opus_demo -e voip  48000 1 32000 -complexity  8
speech_mono_32_48kHz.pcm  tmp.opus

libopus 1.2-beta-27-g6c51a195-dirty
Encoding 48000 Hz input at 32.000 kb/s in auto bandwidth with 960-sample
frames.

 xy_c =  0.000000, xy   =  0.000000 | xy_c =  5.605194e-45, xy   =
 0.000000e+00 | xy_c (bin) = 0x       4, xy   (bin) = 0x       0

 N = 8
 x[0] = -0.000000, y[0] = -0.000000 | x[0] = -7.783648e-23, y[0] =
-7.783648e-23 | x[0] (bin) = 0x9abc3273, y[0] (bin) = 0x9abc3273
 x[1] = -0.000000, y[1] = -0.000000 | x[1] = -1.862279e-23, y[1] =
-1.862279e-23 | x[1] (bin) = 0x99b41bca, y[1] (bin) = 0x99b41bca
 x[2] =  0.000000, y[2] =  0.000000 | x[2] =  1.092297e-23, y[2] =
 1.092297e-23 | x[2] (bin) = 0x195347ee, y[2] (bin) = 0x195347ee
 x[3] = -0.000000, y[3] = -0.000000 | x[3] = -5.171255e-25, y[3] =
-5.171255e-25 | x[3] (bin) = 0x97200ae8, y[3] (bin) = 0x97200ae8
 x[4] = -0.000000, y[4] = -0.000000 | x[4] = -4.773915e-24, y[4] =
-4.773915e-24 | x[4] (bin) = 0x98b8ae90, y[4] (bin) = 0x98b8ae90
 x[5] = -0.000000, y[5] = -0.000000 | x[5] = -3.717311e-25, y[5] =
-3.717311e-25 | x[5] (bin) = 0x96e61724, y[5] (bin) = 0x96e61724
 x[6] = -0.000000, y[6] = -0.000000 | x[6] = -1.724025e-24, y[6] =
-1.724025e-24 | x[6] (bin) = 0x980563d5, y[6] (bin) = 0x980563d5
 x[7] = -0.000000, y[7] = -0.000000 | x[7] = -2.245937e-24, y[7] =
-2.245937e-24 | x[7] (bin) = 0x982dc55f, y[7] (bin) = 0x982dc55f

==============================================================

 xy_c =  0.000000, xy   =  0.000000 | xy_c =  1.121039e-44, xy   =
 0.000000e+00 | xy_c (bin) = 0x       8, xy   (bin) = 0x       0

 N = 8
 x[0] = -0.000000, y[0] = -0.000000 | x[0] = -1.000134e-22, y[0] =
-1.000134e-22 | x[0] (bin) = 0x9af1d148, y[0] (bin) = 0x9af1d148
 x[1] =  0.000000, y[1] =  0.000000 | x[1] =  3.052170e-23, y[1] =
 3.052170e-23 | x[1] (bin) = 0x1a139809, y[1] (bin) = 0x1a139809
 x[2] = -0.000000, y[2] = -0.000000 | x[2] = -2.135591e-23, y[2] =
-2.135591e-23 | x[2] (bin) = 0x99ce8aaf, y[2] (bin) = 0x99ce8aaf
 x[3] =  0.000000, y[3] =  0.000000 | x[3] =  1.180839e-23, y[3] =
 1.180839e-23 | x[3] (bin) = 0x19646856, y[3] (bin) = 0x19646856
 x[4] = -0.000000, y[4] = -0.000000 | x[4] = -1.230446e-23, y[4] =
-1.230446e-23 | x[4] (bin) = 0x996e00bc, y[4] (bin) = 0x996e00bc
 x[5] =  0.000000, y[5] =  0.000000 | x[5] =  6.443248e-24, y[5] =
 6.443248e-24 | x[5] (bin) = 0x18f942d6, y[5] (bin) = 0x18f942d6
 x[6] = -0.000000, y[6] = -0.000000 | x[6] = -8.497414e-24, y[6] =
-8.497414e-24 | x[6] (bin) = 0x99245d28, y[6] (bin) = 0x99245d28
 x[7] =  0.000000, y[7] =  0.000000 | x[7] =  3.849347e-24, y[7] =
 3.849347e-24 | x[7] (bin) = 0x1894ea17, y[7] (bin) = 0x1894ea17

There are 3 possible reasons.
1. Of course celt_inner_prod_neon_float_c_simulation() may have bug. Please
help me find if any.
2. Though impossible, it's possible NEON is not IEEE 754-compliant when
dealing with near 0 floating-point values.
3. Though more impossible, it's possible gcc is not IEEE 754-compliant
here. :)

Since x[i] == y[i] in both cases, they are actually calculating the energy.
(-1.000134e-22 * -1.000134e-22) is larger than the smallest
single-precision number and should be represented as none-zero (such as
0x8). I don't know why NEON gives 0 result.

Thanks,
Linfeng

On Tue, Jun 6, 2017 at 12:03 AM, Ulrich Windl <Ulrich.Windl at rz.uni-regensbur
g.de> wrote:

> >>> Linfeng Zhang <linfengz at google.com> schrieb am 06.06.2017 um 06:46 in
> Nachricht
> <CAKoqLCAfj+fDUMLfN4dLNSZ4NNAZpaSt_BWZRp+7XBqfhiSqiQ at mail.gmail.com>:
> > Hi Jean-Marc,
> >
> > I tried "==" before, and it failed when both results are 0.0. Maybe the
> > exponent or sign has difference because of the different 0.0
> representation
> > in NEON. If anybody know how to handle this 0.0 comparison, that would be
> > great.
> > Or just use if(a==b || (a==0.0 && b==0.0)) ... but I haven't try this.
>
>
> From some faint memory of my math lessions I had produced code like this
> to get the smallest floating-point number different from zero:
>
> double  EPS;            /* smallest number not equal to 0.0 */
>
> /* refined estimate of EPS */
> static  double  get_EPS(double eps)
> {
>
>         while ( 1.0 + eps != 1.0 )
>                 eps /= 2;
>         return(eps);
> }
>
> EPS = get_EPS(1.0);
>
> On the x86_64 platform I get:
> (gdb) p EPS
> $1 = 1.1102230246251565e-16
>
> Maybe it can help...
>
> Regards,
> Ulrich
>
> >
> > Thanks,
> > Linfeng
> >
> > On Mon, Jun 5, 2017 at 8:43 PM Jean-Marc Valin <jmvalin at jmvalin.ca>
> wrote:
> >
> >> Hi Linfeng,
> >>
> >> On 05/06/17 03:31 PM, Linfeng Zhang wrote:
> >> > Yes we'll have one more patch set related to xcorr in next week.
> Please
> >> > don't wait if it's too late for 1.2 release.
> >>
> >> Assuming there's no issue with the patches, next week isn't too late.
> >>
> >> Also, I've started looking at your patches. So far there's one thing
> >> that puzzles me a bit. In the OPUS_CHECK_ASM section of patch 0004, you
> >> have:
> >>
> >> +        celt_assert(ABS32(xy1_c - *xy1) <= VERY_SMALL);
> >>
> >> Given the normal range of the values (the xy values are often much
> >> larger than one) and the precision involved here (24-bit mantissa), it
> >> seems like this test can only succeed if the two values are actually
> >> equal. Is the float patch actually bit-exact? If so, then maybe you
> >> should be using actual equality. If not, then I guess we need to find
> >> the right condition (which isn't obvious for floating point).
> >>
> >> Cheers,
> >>
> >>         Jean-Marc
> >>
> >>
> >> > Thanks,
> >> > Linfeng
> >> >
> >> > On Mon, Jun 5, 2017 at 12:28 PM, Linfeng Zhang <linfengz at google.com
> >> > <mailto:linfengz at google.com>> wrote:
> >> >
> >> >     Hi Jean-Marc,
> >> >
> >> >     I attached the new version in inner_prod_5patches_v2.zip which
> >> >     synced to the current master.
> >> >
> >> >     For fixed-point ARM, only
> >> >     0003-Optimize-fixed-point-celt_inner_prod-and-dual_inner_.patch
> >> >     changes the performance.
> >> >     For floating-point ARM, only
> >> >     0004-Optimize-floating-point-celt_inner_prod-and-dual_inn.pa
> >> >     <http://elt_inner_prod-and-dual_inn.pa>tch changes the
> performance.
> >> >     Patch 1 and 2 are code clean-up and can only affect x86
> performance.
> >> >     Patch 5 has neglectable effect on floating-point ARM performance.
> >> >
> >> >     Thanks,
> >> >     Linfeng
> >> >
> >> >     On Fri, Jun 2, 2017 at 11:26 AM, Jean-Marc Valin <
> jmvalin at jmvalin.ca
> >> >     <mailto:jmvalin at jmvalin.ca>> wrote:
> >> >
> >> >         Hi Linfeng,
> >> >
> >> >         I'll look into your patches. Can you let me know what's the
> >> expected
> >> >         effect on performance (if any) for each of your patches? Also,
> >> >         are these
> >> >         all the patches you intend to merge for 1.2 or are there more
> >> >         upcoming ones?
> >> >
> >> >         Cheers,
> >> >
> >> >                 Jean-Marc
> >> >
> >> >         On 01/06/17 06:33 PM, Linfeng Zhang wrote:
> >> >         > Hi,
> >> >         >
> >> >         > Attached are 5 patches related to celt_inner_prod()
> >> >         > and dual_inner_prod() NEON intrinsics optimization.
> >> >         >
> >> >         > In
> >> >         0004-Optimize-floating-point-celt_inner_prod-and-dual_inn.pa
> >> >         <http://elt_inner_prod-and-dual_inn.pa>tch, the
> >> >         > optimization changed the order of floating-point inner
> >> >         products, which
> >> >         > will change the results. I
> >> >         > created celt_inner_prod_neon_float_c_simulation()
> >> >         > and dual_inner_prod_neon_float_c_simulation() to simulate
> the
> >> >         order
> >> >         > floating-point operations in NEON optimization and compare
> >> their
> >> >         > results. Sorry that I cannot bond the distance between
> >> original C
> >> >         > function and NEON function to any giving reasonable small
> >> >         number or
> >> >         > ratio. It's easy to create an input which 0 and 1,000 are
> both
> >> >         correct
> >> >         > results by just manipulating the inner product order.
> >> >         >
> >> >         > The total speed gain is about 1.0% for fixed-point encoder,
> >> >         and 1.8% for
> >> >         > floating-point encoder, in Complexity 8, tested on my
> >> Chromebook.
> >> >         >
> >> >         > Thanks,
> >> >         > Linfeng
> >> >         >
> >> >         >
> >> >         > _______________________________________________
> >> >         > opus mailing list
> >> >         > opus at xiph.org <mailto:opus at xiph.org>
> >> >         > http://lists.xiph.org/mailman/listinfo/opus
> >> >         <http://lists.xiph.org/mailman/listinfo/opus>
> >> >         >
> >> >
> >> >
> >> >
> >>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xiph.org/pipermail/opus/attachments/20170606/b0777ead/attachment-0001.html>