[opus] Bug fix in celt_lpc.c and some xcorr_kernel optimizations
John Ridges
jridges at masque.com
Fri Jun 7 15:50:48 PDT 2013
Unfortunately I don't have a setup that lets me easily profile ARM code,
so I really can't tell which method is faster (though I suspect Mr.
Zanelli's code is). Let me offer up another intrinsic version of the
NEON xcorr_kernel that is almost identical to the SSE version, and more
in line with Mr. Zanelli's code:
static inline void xcorr_kernel_neon(const opus_val16 *x, const
opus_val16 *y, opus_val32 sum[4], int len)
{
int j;
int32x4_t xsum1 = vld1q_s32(sum);
int32x4_t xsum2 = vdupq_n_s32(0);
for (j = 0; j < len-3; j += 4) {
int16x4_t x0 = vld1_s16(x+j);
int16x4_t y0 = vld1_s16(y+j);
int16x4_t y3 = vld1_s16(y+j+3);
int16x4_t y4 = vext_s16(y3,y3,1);
xsum1 = vmlal_s16(xsum1,vdup_lane_s16(x0,0),y0);
xsum2 = vmlal_s16(xsum2,vdup_lane_s16(x0,1),vext_s16(y0,y4,1));
xsum1 = vmlal_s16(xsum1,vdup_lane_s16(x0,2),vext_s16(y0,y4,2));
xsum2 = vmlal_s16(xsum2,vdup_lane_s16(x0,3),y3);
}
if (j < len) {
xsum1 = vmlal_s16(xsum1,vdup_n_s16(*(x+j)),vld1_s16(y+j));
if (++j < len) {
xsum2 = vmlal_s16(xsum2,vdup_n_s16(*(x+j)),vld1_s16(y+j));
if (++j < len) {
xsum1 = vmlal_s16(xsum1,vdup_n_s16(*(x+j)),vld1_s16(y+j));
}
}
}
vst1q_s32(sum,vaddq_s32(xsum1,xsum2));
}
Whether or not this version is faster than the first version I submitted
probably depends a lot on how fast unaligned memory vector accesses are
on an ARM processor. Of course hand-coded assembly would be even faster
than using intrinsics (for instance the "vdup_lane_s16"s wouldn't be
needed), but in this case it could be that the multiply-add stalls swamp
most of the inefficiencies in the intrinsic code. It would be cool if
someone out there has a setup that would let us definitively find out
which is fastest and by how much.
If the hit from using intrinsics isn't too bad, I would prefer them
since they are compatible with I think nearly all ARM compilers (and in
truth I also prefer using intrinsics for NEON code because I'm just not
that familiar with ARM assembly).
Cheers,
--John
On 6/7/2013 12:51 PM, Jean-Marc Valin wrote:
> On 06/07/2013 02:33 PM, John Ridges wrote:
>> I have no doubt that Mr. Zanelli's NEON code is faster, since hand tuned
>> assembly is bound to be faster than using intrinsics.
> I was mostly curious about comparing vectorization approaches (assuming
> the two are different) than exact code.
>
>> However I notice
>> that his code can also read past the y buffer.
> Yeah we'd need to either fix this or make sure that we add some padding
> to the buffers. In practice it's unlikely to even trigger valgrind (it's
> on the stack and the uninitialized data ends up being discarded), but
> it's definitely not clean and could come back and bite us later.
>
> Cheers,
>
> Jean-Marc
>
>> Cheers,
>> --John
>>
>>
>> On 6/6/2013 9:22 PM, Jean-Marc Valin wrote:
>>> Hi John,
>>>
>>> Thanks for the two fixes. They're in git now. Your SSE version seems to
>>> also be slightly faster than mine -- probably due the the partial sums.
>>> As for the NEON code, it would be good to compare the performance with
>>> the code Aurélien Zanelli posted at
>>> http://darkosphere.fr/public/0002-Add-optimized-NEON-version-of-celt_fir-celt_iir-and-.patch
>>>
>>>
>>> Cheers,
>>>
>>> Jean-Marc
>>>
>>>
>>
>>
>
More information about the opus
mailing list