[opus] Bug fix in celt_lpc.c and some xcorr_kernel optimizations

John Ridges jridges at masque.com
Fri Jun 7 15:50:48 PDT 2013

Unfortunately I don't have a setup that lets me easily profile ARM code, 
so I really can't tell which method is faster (though I suspect Mr. 
Zanelli's code is). Let me offer up another intrinsic version of the 
NEON xcorr_kernel that is almost identical to the SSE version, and more 
in line with Mr. Zanelli's code:

static inline void xcorr_kernel_neon(const opus_val16 *x, const 
opus_val16 *y, opus_val32 sum[4], int len)
     int j;
     int32x4_t xsum1 = vld1q_s32(sum);
     int32x4_t xsum2 = vdupq_n_s32(0);

     for (j = 0; j < len-3; j += 4) {
         int16x4_t x0 = vld1_s16(x+j);
         int16x4_t y0 = vld1_s16(y+j);
         int16x4_t y3 = vld1_s16(y+j+3);
         int16x4_t y4 = vext_s16(y3,y3,1);

         xsum1 = vmlal_s16(xsum1,vdup_lane_s16(x0,0),y0);
         xsum2 = vmlal_s16(xsum2,vdup_lane_s16(x0,1),vext_s16(y0,y4,1));
         xsum1 = vmlal_s16(xsum1,vdup_lane_s16(x0,2),vext_s16(y0,y4,2));
         xsum2 = vmlal_s16(xsum2,vdup_lane_s16(x0,3),y3);
     if (j < len) {
         xsum1 = vmlal_s16(xsum1,vdup_n_s16(*(x+j)),vld1_s16(y+j));
         if (++j < len) {
             xsum2 = vmlal_s16(xsum2,vdup_n_s16(*(x+j)),vld1_s16(y+j));
             if (++j < len) {
                 xsum1 = vmlal_s16(xsum1,vdup_n_s16(*(x+j)),vld1_s16(y+j));

Whether or not this version is faster than the first version I submitted 
probably depends a lot on how fast unaligned memory vector accesses are 
on an ARM processor. Of course hand-coded assembly would be even faster 
than using intrinsics (for instance the "vdup_lane_s16"s wouldn't be 
needed), but in this case it could be that the multiply-add stalls swamp 
most of the inefficiencies in the intrinsic code. It would be cool if 
someone out there has a setup that would let us definitively find out 
which is fastest and by how much.

If the hit from using intrinsics isn't too bad, I would prefer them 
since they are compatible with I think nearly all ARM compilers (and in 
truth I also prefer using intrinsics for NEON code because I'm just not 
that familiar with ARM assembly).


On 6/7/2013 12:51 PM, Jean-Marc Valin wrote:
> On 06/07/2013 02:33 PM, John Ridges wrote:
>> I have no doubt that Mr. Zanelli's NEON code is faster, since hand tuned
>> assembly is bound to be faster than using intrinsics.
> I was mostly curious about comparing vectorization approaches (assuming
> the two are different) than exact code.
>> However I notice
>> that his code can also read past the y buffer.
> Yeah we'd need to either fix this or make sure that we add some padding
> to the buffers. In practice it's unlikely to even trigger valgrind (it's
> on the stack and the uninitialized data ends up being discarded), but
> it's definitely not clean and could come back and bite us later.
> Cheers,
> 	Jean-Marc
>> Cheers,
>> --John
>> On 6/6/2013 9:22 PM, Jean-Marc Valin wrote:
>>> Hi John,
>>> Thanks for the two fixes. They're in git now. Your SSE version seems to
>>> also be slightly faster than mine -- probably due the the partial sums.
>>> As for the NEON code, it would be good to compare the performance with
>>> the code Aurélien Zanelli posted at
>>> http://darkosphere.fr/public/0002-Add-optimized-NEON-version-of-celt_fir-celt_iir-and-.patch
>>> Cheers,
>>>      Jean-Marc

More information about the opus mailing list