[opus] [OPUS] celt_inner_prod() and dual_inner_prod() NEON intrinsics
Linfeng Zhang
linfengz at google.com
Tue Jun 6 18:27:24 UTC 2017
Thank Ulrich!
Yes, using
celt_assert(1.0 + celt_inner_prod_neon_float_c_simulation(x, y, N)
== 1.0 + xy);
celt_assert(1.0 + xy1_c == 1.0 + *xy1);
celt_assert(1.0 + xy2_c == 1.0 + *xy2);
can avoid the useage of VERY_SMALL.
Hi Jean-Marc,
I added
{
const opus_val32 xy_c = celt_inner_prod_neon_float_c_simulation(x,
y, N);
const int32_t *x_bin = (int32_t*)x;
const int32_t *y_bin = (int32_t*)y;
const int32_t *xy_bin = (int32_t*)&xy;
const int32_t *xy_bin_c = (int32_t*)&xy_c;
// if((xy_c != xy) && (xy_c != 0.0) && (xy != 0.0)) {
if(xy_c != xy) {
printf("\n xy_c = %9f, xy = %9f", xy_c, xy);
printf(" | xy_c = %13e, xy = %13e", xy_c, xy);
printf(" | xy_c (bin) = 0x%8x, xy (bin) = 0x%8x\n",
*xy_bin_c, *xy_bin);
printf("\n N = %d", N);
for (i = 0; i < N; i++) {
printf("\n x[%d] = %9f, y[%d] = %9f", i, x[i], i, y[i]);
printf(" | x[%d] = %13e, y[%d] = %13e", i, x[i], i, y[i]);
printf(" | x[%d] (bin) = 0x%8x, y[%d] (bin) = 0x%8x", i,
x_bin[i], i, y_bin[i]);
}
printf("\n\n");
}
}
And got the following two cases when testing speech_mono_32_48kHz.pcm
(Download: https://drive.google.com/file/d/0B2bjttuYjfVYaHBDZE1XV3B0MHM) on
NEON:
$ ./opus_demo -e voip 48000 1 32000 -complexity 8
speech_mono_32_48kHz.pcm tmp.opus
libopus 1.2-beta-27-g6c51a195-dirty
Encoding 48000 Hz input at 32.000 kb/s in auto bandwidth with 960-sample
frames.
xy_c = 0.000000, xy = 0.000000 | xy_c = 5.605194e-45, xy =
0.000000e+00 | xy_c (bin) = 0x 4, xy (bin) = 0x 0
N = 8
x[0] = -0.000000, y[0] = -0.000000 | x[0] = -7.783648e-23, y[0] =
-7.783648e-23 | x[0] (bin) = 0x9abc3273, y[0] (bin) = 0x9abc3273
x[1] = -0.000000, y[1] = -0.000000 | x[1] = -1.862279e-23, y[1] =
-1.862279e-23 | x[1] (bin) = 0x99b41bca, y[1] (bin) = 0x99b41bca
x[2] = 0.000000, y[2] = 0.000000 | x[2] = 1.092297e-23, y[2] =
1.092297e-23 | x[2] (bin) = 0x195347ee, y[2] (bin) = 0x195347ee
x[3] = -0.000000, y[3] = -0.000000 | x[3] = -5.171255e-25, y[3] =
-5.171255e-25 | x[3] (bin) = 0x97200ae8, y[3] (bin) = 0x97200ae8
x[4] = -0.000000, y[4] = -0.000000 | x[4] = -4.773915e-24, y[4] =
-4.773915e-24 | x[4] (bin) = 0x98b8ae90, y[4] (bin) = 0x98b8ae90
x[5] = -0.000000, y[5] = -0.000000 | x[5] = -3.717311e-25, y[5] =
-3.717311e-25 | x[5] (bin) = 0x96e61724, y[5] (bin) = 0x96e61724
x[6] = -0.000000, y[6] = -0.000000 | x[6] = -1.724025e-24, y[6] =
-1.724025e-24 | x[6] (bin) = 0x980563d5, y[6] (bin) = 0x980563d5
x[7] = -0.000000, y[7] = -0.000000 | x[7] = -2.245937e-24, y[7] =
-2.245937e-24 | x[7] (bin) = 0x982dc55f, y[7] (bin) = 0x982dc55f
==============================================================
xy_c = 0.000000, xy = 0.000000 | xy_c = 1.121039e-44, xy =
0.000000e+00 | xy_c (bin) = 0x 8, xy (bin) = 0x 0
N = 8
x[0] = -0.000000, y[0] = -0.000000 | x[0] = -1.000134e-22, y[0] =
-1.000134e-22 | x[0] (bin) = 0x9af1d148, y[0] (bin) = 0x9af1d148
x[1] = 0.000000, y[1] = 0.000000 | x[1] = 3.052170e-23, y[1] =
3.052170e-23 | x[1] (bin) = 0x1a139809, y[1] (bin) = 0x1a139809
x[2] = -0.000000, y[2] = -0.000000 | x[2] = -2.135591e-23, y[2] =
-2.135591e-23 | x[2] (bin) = 0x99ce8aaf, y[2] (bin) = 0x99ce8aaf
x[3] = 0.000000, y[3] = 0.000000 | x[3] = 1.180839e-23, y[3] =
1.180839e-23 | x[3] (bin) = 0x19646856, y[3] (bin) = 0x19646856
x[4] = -0.000000, y[4] = -0.000000 | x[4] = -1.230446e-23, y[4] =
-1.230446e-23 | x[4] (bin) = 0x996e00bc, y[4] (bin) = 0x996e00bc
x[5] = 0.000000, y[5] = 0.000000 | x[5] = 6.443248e-24, y[5] =
6.443248e-24 | x[5] (bin) = 0x18f942d6, y[5] (bin) = 0x18f942d6
x[6] = -0.000000, y[6] = -0.000000 | x[6] = -8.497414e-24, y[6] =
-8.497414e-24 | x[6] (bin) = 0x99245d28, y[6] (bin) = 0x99245d28
x[7] = 0.000000, y[7] = 0.000000 | x[7] = 3.849347e-24, y[7] =
3.849347e-24 | x[7] (bin) = 0x1894ea17, y[7] (bin) = 0x1894ea17
There are 3 possible reasons.
1. Of course celt_inner_prod_neon_float_c_simulation() may have bug. Please
help me find if any.
2. Though impossible, it's possible NEON is not IEEE 754-compliant when
dealing with near 0 floating-point values.
3. Though more impossible, it's possible gcc is not IEEE 754-compliant
here. :)
Since x[i] == y[i] in both cases, they are actually calculating the energy.
(-1.000134e-22 * -1.000134e-22) is larger than the smallest
single-precision number and should be represented as none-zero (such as
0x8). I don't know why NEON gives 0 result.
Thanks,
Linfeng
On Tue, Jun 6, 2017 at 12:03 AM, Ulrich Windl <Ulrich.Windl at rz.uni-regensbur
g.de> wrote:
> >>> Linfeng Zhang <linfengz at google.com> schrieb am 06.06.2017 um 06:46 in
> Nachricht
> <CAKoqLCAfj+fDUMLfN4dLNSZ4NNAZpaSt_BWZRp+7XBqfhiSqiQ at mail.gmail.com>:
> > Hi Jean-Marc,
> >
> > I tried "==" before, and it failed when both results are 0.0. Maybe the
> > exponent or sign has difference because of the different 0.0
> representation
> > in NEON. If anybody know how to handle this 0.0 comparison, that would be
> > great.
> > Or just use if(a==b || (a==0.0 && b==0.0)) ... but I haven't try this.
>
>
> From some faint memory of my math lessions I had produced code like this
> to get the smallest floating-point number different from zero:
>
> double EPS; /* smallest number not equal to 0.0 */
>
> /* refined estimate of EPS */
> static double get_EPS(double eps)
> {
>
> while ( 1.0 + eps != 1.0 )
> eps /= 2;
> return(eps);
> }
>
> EPS = get_EPS(1.0);
>
> On the x86_64 platform I get:
> (gdb) p EPS
> $1 = 1.1102230246251565e-16
>
> Maybe it can help...
>
> Regards,
> Ulrich
>
> >
> > Thanks,
> > Linfeng
> >
> > On Mon, Jun 5, 2017 at 8:43 PM Jean-Marc Valin <jmvalin at jmvalin.ca>
> wrote:
> >
> >> Hi Linfeng,
> >>
> >> On 05/06/17 03:31 PM, Linfeng Zhang wrote:
> >> > Yes we'll have one more patch set related to xcorr in next week.
> Please
> >> > don't wait if it's too late for 1.2 release.
> >>
> >> Assuming there's no issue with the patches, next week isn't too late.
> >>
> >> Also, I've started looking at your patches. So far there's one thing
> >> that puzzles me a bit. In the OPUS_CHECK_ASM section of patch 0004, you
> >> have:
> >>
> >> + celt_assert(ABS32(xy1_c - *xy1) <= VERY_SMALL);
> >>
> >> Given the normal range of the values (the xy values are often much
> >> larger than one) and the precision involved here (24-bit mantissa), it
> >> seems like this test can only succeed if the two values are actually
> >> equal. Is the float patch actually bit-exact? If so, then maybe you
> >> should be using actual equality. If not, then I guess we need to find
> >> the right condition (which isn't obvious for floating point).
> >>
> >> Cheers,
> >>
> >> Jean-Marc
> >>
> >>
> >> > Thanks,
> >> > Linfeng
> >> >
> >> > On Mon, Jun 5, 2017 at 12:28 PM, Linfeng Zhang <linfengz at google.com
> >> > <mailto:linfengz at google.com>> wrote:
> >> >
> >> > Hi Jean-Marc,
> >> >
> >> > I attached the new version in inner_prod_5patches_v2.zip which
> >> > synced to the current master.
> >> >
> >> > For fixed-point ARM, only
> >> > 0003-Optimize-fixed-point-celt_inner_prod-and-dual_inner_.patch
> >> > changes the performance.
> >> > For floating-point ARM, only
> >> > 0004-Optimize-floating-point-celt_inner_prod-and-dual_inn.pa
> >> > <http://elt_inner_prod-and-dual_inn.pa>tch changes the
> performance.
> >> > Patch 1 and 2 are code clean-up and can only affect x86
> performance.
> >> > Patch 5 has neglectable effect on floating-point ARM performance.
> >> >
> >> > Thanks,
> >> > Linfeng
> >> >
> >> > On Fri, Jun 2, 2017 at 11:26 AM, Jean-Marc Valin <
> jmvalin at jmvalin.ca
> >> > <mailto:jmvalin at jmvalin.ca>> wrote:
> >> >
> >> > Hi Linfeng,
> >> >
> >> > I'll look into your patches. Can you let me know what's the
> >> expected
> >> > effect on performance (if any) for each of your patches? Also,
> >> > are these
> >> > all the patches you intend to merge for 1.2 or are there more
> >> > upcoming ones?
> >> >
> >> > Cheers,
> >> >
> >> > Jean-Marc
> >> >
> >> > On 01/06/17 06:33 PM, Linfeng Zhang wrote:
> >> > > Hi,
> >> > >
> >> > > Attached are 5 patches related to celt_inner_prod()
> >> > > and dual_inner_prod() NEON intrinsics optimization.
> >> > >
> >> > > In
> >> > 0004-Optimize-floating-point-celt_inner_prod-and-dual_inn.pa
> >> > <http://elt_inner_prod-and-dual_inn.pa>tch, the
> >> > > optimization changed the order of floating-point inner
> >> > products, which
> >> > > will change the results. I
> >> > > created celt_inner_prod_neon_float_c_simulation()
> >> > > and dual_inner_prod_neon_float_c_simulation() to simulate
> the
> >> > order
> >> > > floating-point operations in NEON optimization and compare
> >> their
> >> > > results. Sorry that I cannot bond the distance between
> >> original C
> >> > > function and NEON function to any giving reasonable small
> >> > number or
> >> > > ratio. It's easy to create an input which 0 and 1,000 are
> both
> >> > correct
> >> > > results by just manipulating the inner product order.
> >> > >
> >> > > The total speed gain is about 1.0% for fixed-point encoder,
> >> > and 1.8% for
> >> > > floating-point encoder, in Complexity 8, tested on my
> >> Chromebook.
> >> > >
> >> > > Thanks,
> >> > > Linfeng
> >> > >
> >> > >
> >> > > _______________________________________________
> >> > > opus mailing list
> >> > > opus at xiph.org <mailto:opus at xiph.org>
> >> > > http://lists.xiph.org/mailman/listinfo/opus
> >> > <http://lists.xiph.org/mailman/listinfo/opus>
> >> > >
> >> >
> >> >
> >> >
> >>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xiph.org/pipermail/opus/attachments/20170606/b0777ead/attachment-0001.html>
More information about the opus
mailing list