[opus] opus Digest, Vol 53, Issue 2

Cliff Parris cliff at espico.com
Mon Jun 10 02:36:34 PDT 2013


Hi All,

Regarding cycle measurements for ARM/NEON,

ARM no longer provide cycle accurate simulators. The method we use is to to
make measurements on hardware via the PMU unit on the core itself. Note that
if your running under Linux you may be 'allowed' to access the PMU directly
but can access via it system calls. Typically you will need to make a series
of measurements and average them.

Re intrinsics, I believe that GCC and ARM's own compiler are not compatible.
We write directly in ASM since typically neither compilers do what you want.

Cliff

-----Original Message----- 
From: opus-request at xiph.org
Sent: Saturday, June 08, 2013 3:54 AM
To: opus at xiph.org
Subject: opus Digest, Vol 53, Issue 2

Send opus mailing list submissions to
opus at xiph.org

To subscribe or unsubscribe via the World Wide Web, visit
http://lists.xiph.org/mailman/listinfo/opus
or, via email, send a message with subject or body 'help' to
opus-request at xiph.org

You can reach the person managing the list at
opus-owner at xiph.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of opus digest..."


Today's Topics:

   1. Re: Bug fix in celt_lpc.c and some xcorr_kernel optimizations
      (John Ridges)
   2. Invitation to connect on LinkedIn (casey guan)
   3. Invitation to connect on LinkedIn (casey guan)


----------------------------------------------------------------------

Message: 1
Date: Fri, 07 Jun 2013 16:50:48 -0600
From: John Ridges <jridges at masque.com>
Subject: Re: [opus] Bug fix in celt_lpc.c and some xcorr_kernel
optimizations
To: Jean-Marc Valin <jmvalin at jmvalin.ca>
Cc: opus at xiph.org
Message-ID: <51B263C8.5060203 at masque.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

Unfortunately I don't have a setup that lets me easily profile ARM code,
so I really can't tell which method is faster (though I suspect Mr.
Zanelli's code is). Let me offer up another intrinsic version of the
NEON xcorr_kernel that is almost identical to the SSE version, and more
in line with Mr. Zanelli's code:

static inline void xcorr_kernel_neon(const opus_val16 *x, const
opus_val16 *y, opus_val32 sum[4], int len)
{
     int j;
     int32x4_t xsum1 = vld1q_s32(sum);
     int32x4_t xsum2 = vdupq_n_s32(0);

     for (j = 0; j < len-3; j += 4) {
         int16x4_t x0 = vld1_s16(x+j);
         int16x4_t y0 = vld1_s16(y+j);
         int16x4_t y3 = vld1_s16(y+j+3);
         int16x4_t y4 = vext_s16(y3,y3,1);

         xsum1 = vmlal_s16(xsum1,vdup_lane_s16(x0,0),y0);
         xsum2 = vmlal_s16(xsum2,vdup_lane_s16(x0,1),vext_s16(y0,y4,1));
         xsum1 = vmlal_s16(xsum1,vdup_lane_s16(x0,2),vext_s16(y0,y4,2));
         xsum2 = vmlal_s16(xsum2,vdup_lane_s16(x0,3),y3);
     }
     if (j < len) {
         xsum1 = vmlal_s16(xsum1,vdup_n_s16(*(x+j)),vld1_s16(y+j));
         if (++j < len) {
             xsum2 = vmlal_s16(xsum2,vdup_n_s16(*(x+j)),vld1_s16(y+j));
             if (++j < len) {
                 xsum1 = vmlal_s16(xsum1,vdup_n_s16(*(x+j)),vld1_s16(y+j));
             }
         }
     }
     vst1q_s32(sum,vaddq_s32(xsum1,xsum2));
}

Whether or not this version is faster than the first version I submitted
probably depends a lot on how fast unaligned memory vector accesses are
on an ARM processor. Of course hand-coded assembly would be even faster
than using intrinsics (for instance the "vdup_lane_s16"s wouldn't be
needed), but in this case it could be that the multiply-add stalls swamp
most of the inefficiencies in the intrinsic code. It would be cool if
someone out there has a setup that would let us definitively find out
which is fastest and by how much.

If the hit from using intrinsics isn't too bad, I would prefer them
since they are compatible with I think nearly all ARM compilers (and in
truth I also prefer using intrinsics for NEON code because I'm just not
that familiar with ARM assembly).

Cheers,
--John


On 6/7/2013 12:51 PM, Jean-Marc Valin wrote:
> On 06/07/2013 02:33 PM, John Ridges wrote:
>> I have no doubt that Mr. Zanelli's NEON code is faster, since hand tuned
>> assembly is bound to be faster than using intrinsics.
> I was mostly curious about comparing vectorization approaches (assuming
> the two are different) than exact code.
>
>> However I notice
>> that his code can also read past the y buffer.
> Yeah we'd need to either fix this or make sure that we add some padding
> to the buffers. In practice it's unlikely to even trigger valgrind (it's
> on the stack and the uninitialized data ends up being discarded), but
> it's definitely not clean and could come back and bite us later.
>
> Cheers,
>
> Jean-Marc
>
>> Cheers,
>> --John
>>
>>
>> On 6/6/2013 9:22 PM, Jean-Marc Valin wrote:
>>> Hi John,
>>>
>>> Thanks for the two fixes. They're in git now. Your SSE version seems to
>>> also be slightly faster than mine -- probably due the the partial sums.
>>> As for the NEON code, it would be good to compare the performance with
>>> the code Aur?lien Zanelli posted at
>>> http://darkosphere.fr/public/0002-Add-optimized-NEON-version-of-celt_fir-celt_iir-and-.patch
>>>
>>>
>>> Cheers,
>>>
>>>      Jean-Marc
>>>
>>>
>>
>>
>




------------------------------

Message: 2
Date: Sat, 8 Jun 2013 02:54:03 +0000 (UTC)
From: casey guan <guanxiansun at gmail.com>
Subject: [opus] Invitation to connect on LinkedIn
To: <opus at xiph.org>
Message-ID:
<439983834.10277145.1370660043700.JavaMail.app at ela4-app0130.prod>
Content-Type: text/plain; charset="utf-8"

LinkedIn
------------



I'd like to add you to my professional network on LinkedIn.

- casey

casey guan
VoIP software enginer at posterity
China

Confirm that you know casey guan:
https://www.linkedin.com/e/-eeq1og-hho7lmea-5i/isd/13997857382/QytrCslx/?hs=false&tok=1hAd-04M6KARM1

--
You are receiving Invitation to Connect emails. Click to unsubscribe:
http://www.linkedin.com/e/-eeq1og-hho7lmea-5i/XsWTO08Es76pJ9MdX8tLktg/goo/opus%40xiph%2Eorg/20061/I4651504436_1/?hs=false&tok=2nUTBvpWiKARM1

(c) 2012 LinkedIn Corporation. 2029 Stierlin Ct, Mountain View, CA 94043,
USA.



-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.xiph.org/pipermail/opus/attachments/20130608/1e4d21af/attachment-0001.htm

------------------------------

Message: 3
Date: Sat, 8 Jun 2013 02:54:15 +0000 (UTC)
From: casey guan <guanxiansun at gmail.com>
Subject: [opus] Invitation to connect on LinkedIn
To: <opus at xiph.org>
Message-ID:
<691255556.10343641.1370660055689.JavaMail.app at ela4-app0128.prod>
Content-Type: text/plain; charset="utf-8"

LinkedIn
------------



I'd like to add you to my professional network on LinkedIn.

- casey

casey guan
VoIP software enginer at posterity
China

Confirm that you know casey guan:
https://www.linkedin.com/e/-eeq1og-hho7lvna-4s/isd/13997857382/QytrCslx/?hs=false&tok=1hAd-04M6KARM1

--
You are receiving Invitation to Connect emails. Click to unsubscribe:
http://www.linkedin.com/e/-eeq1og-hho7lvna-4s/XsWTO08Es76pJ9MdX8tLktg/goo/opus%40xiph%2Eorg/20061/I4651504979_1/?hs=false&tok=3ALofJmQmKARM1

(c) 2012 LinkedIn Corporation. 2029 Stierlin Ct, Mountain View, CA 94043,
USA.



-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.xiph.org/pipermail/opus/attachments/20130608/01cd37fe/attachment.htm

------------------------------

_______________________________________________
opus mailing list
opus at xiph.org
http://lists.xiph.org/mailman/listinfo/opus


End of opus Digest, Vol 53, Issue 2
*********************************** 




More information about the opus mailing list