[opus] Bug fix in celt_lpc.c and some xcorr_kernel optimizations

Fri Jun 7 15:50:48 PDT 2013

Unfortunately I don't have a setup that lets me easily profile ARM code, 
so I really can't tell which method is faster (though I suspect Mr. 
Zanelli's code is). Let me offer up another intrinsic version of the 
NEON xcorr_kernel that is almost identical to the SSE version, and more 
in line with Mr. Zanelli's code:

static inline void xcorr_kernel_neon(const opus_val16 *x, const 
opus_val16 *y, opus_val32 sum[4], int len)
{
     int j;
     int32x4_t xsum1 = vld1q_s32(sum);
     int32x4_t xsum2 = vdupq_n_s32(0);

     for (j = 0; j < len-3; j += 4) {
         int16x4_t x0 = vld1_s16(x+j);
         int16x4_t y0 = vld1_s16(y+j);
         int16x4_t y3 = vld1_s16(y+j+3);
         int16x4_t y4 = vext_s16(y3,y3,1);

         xsum1 = vmlal_s16(xsum1,vdup_lane_s16(x0,0),y0);
         xsum2 = vmlal_s16(xsum2,vdup_lane_s16(x0,1),vext_s16(y0,y4,1));
         xsum1 = vmlal_s16(xsum1,vdup_lane_s16(x0,2),vext_s16(y0,y4,2));
         xsum2 = vmlal_s16(xsum2,vdup_lane_s16(x0,3),y3);
     }
     if (j < len) {
         xsum1 = vmlal_s16(xsum1,vdup_n_s16(*(x+j)),vld1_s16(y+j));
         if (++j < len) {
             xsum2 = vmlal_s16(xsum2,vdup_n_s16(*(x+j)),vld1_s16(y+j));
             if (++j < len) {
                 xsum1 = vmlal_s16(xsum1,vdup_n_s16(*(x+j)),vld1_s16(y+j));
             }
         }
     }
     vst1q_s32(sum,vaddq_s32(xsum1,xsum2));
}

Whether or not this version is faster than the first version I submitted 
probably depends a lot on how fast unaligned memory vector accesses are 
on an ARM processor. Of course hand-coded assembly would be even faster 
than using intrinsics (for instance the "vdup_lane_s16"s wouldn't be 
needed), but in this case it could be that the multiply-add stalls swamp 
most of the inefficiencies in the intrinsic code. It would be cool if 
someone out there has a setup that would let us definitively find out 
which is fastest and by how much.

If the hit from using intrinsics isn't too bad, I would prefer them 
since they are compatible with I think nearly all ARM compilers (and in 
truth I also prefer using intrinsics for NEON code because I'm just not 
that familiar with ARM assembly).

Cheers,
--John

On 6/7/2013 12:51 PM, Jean-Marc Valin wrote:
> On 06/07/2013 02:33 PM, John Ridges wrote:
>> I have no doubt that Mr. Zanelli's NEON code is faster, since hand tuned
>> assembly is bound to be faster than using intrinsics.
> I was mostly curious about comparing vectorization approaches (assuming
> the two are different) than exact code.
>
>> However I notice
>> that his code can also read past the y buffer.
> Yeah we'd need to either fix this or make sure that we add some padding
> to the buffers. In practice it's unlikely to even trigger valgrind (it's
> on the stack and the uninitialized data ends up being discarded), but
> it's definitely not clean and could come back and bite us later.
>
> Cheers,
>
> 	Jean-Marc
>
>> Cheers,
>> --John
>>
>>
>> On 6/6/2013 9:22 PM, Jean-Marc Valin wrote:
>>> Hi John,
>>>
>>> Thanks for the two fixes. They're in git now. Your SSE version seems to
>>> also be slightly faster than mine -- probably due the the partial sums.
>>> As for the NEON code, it would be good to compare the performance with
>>> the code Aurélien Zanelli posted at
>>> http://darkosphere.fr/public/0002-Add-optimized-NEON-version-of-celt_fir-celt_iir-and-.patch
>>>
>>>
>>> Cheers,
>>>
>>>      Jean-Marc
>>>
>>>
>>
>>
>