Unfortunately I don't have a setup that lets me easily profile ARM code, 
so I really can't tell which method is faster (though I suspect Mr. 
Zanelli's code is). Let me offer up another intrinsic version of the 
NEON xcorr_kernel that is almost identical to the SSE version, and more 
in line with Mr. Zanelli's code:

static inline void xcorr_kernel_neon(const opus_val16 *x, const 
opus_val16 *y, opus_val32 sum[4], int len)
     int j;
     int32x4_t xsum1 = vld1q_s32(sum);
     int32x4_t xsum2 = vdupq_n_s32(0);

     for (j = 0; j < len-3; j += 4) {
         int16x4_t x0 = vld1_s16(x+j);
         int16x4_t y0 = vld1_s16(y+j);
         int16x4_t y3 = vld1_s16(y+j+3);
         int16x4_t y4 = vext_s16(y3,y3,1);

         xsum1 = vmlal_s16(xsum1,vdup_lane_s16(x0,0),y0);
         xsum2 = vmlal_s16(xsum2,vdup_lane_s16(x0,1),vext_s16(y0,y4,1));
         xsum1 = vmlal_s16(xsum1,vdup_lane_s16(x0,2),vext_s16(y0,y4,2));
         xsum2 = vmlal_s16(xsum2,vdup_lane_s16(x0,3),y3);
     if (j < len) {
         xsum1 = vmlal_s16(xsum1,vdup_n_s16(*(x+j)),vld1_s16(y+j));
         if (++j < len) {
             xsum2 = vmlal_s16(xsum2,vdup_n_s16(*(x+j)),vld1_s16(y+j));
             if (++j < len) {
                 xsum1 = vmlal_s16(xsum1,vdup_n_s16(*(x+j)),vld1_s16(y+j));

Whether or not this version is faster than the first version I submitted 
probably depends a lot on how fast unaligned memory vector accesses are 
on an ARM processor. Of course hand-coded assembly would be even faster 
than using intrinsics (for instance the "vdup_lane_s16"s wouldn't be 
needed), but in this case it could be that the multiply-add stalls swamp 
most of the inefficiencies in the intrinsic code. It would be cool if 
someone out there has a setup that would let us definitively find out 
which is fastest and by how much.

If the hit from using intrinsics isn't too bad, I would prefer them 
since they are compatible with I think nearly all ARM compilers (and in 
truth I also prefer using intrinsics for NEON code because I'm just not 
that familiar with ARM assembly).


