[opus] [RFC PATCHv1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
Viswanath Puttagunta
viswanath.puttagunta at linaro.org
Mon Dec 1 09:13:22 PST 2014
Hello Timothy,
Appreciate the thorough review. Have a few questions before I re-spin
the patch in-line.
On 28 November 2014 at 15:52, Timothy B. Terriberry <tterribe at xiph.org> wrote:
> Review comments inline.
>
>> +if OPUS_ARM_NEON_INTR
>> +noinst_LTLIBRARIES = libarmneon.la
>> +libarmneon_la_SOURCES = $(CELT_SOURCES_ARM_NEON_INTR)
>> +libarmneon_la_CPPFLAGS = $(OPUS_ARM_NEON_INTR_CPPFLAGS) -I$(top_srcdir)/include
>> +endif
>
> I don't think these should be in a separate library. It brings with it
> lots of complications (to name one: wouldn't the .pc files need to be
> updated?). Please use the same mechanism that the SSE intrinsics use to
> add CFLAGS to the compilation of specific object files, e.g.,
Sorry, I don't know what .pc file is. Also no strong opinions on my
side.. but I was merely following
guidance from
http://www.gnu.org/software/automake/manual/html_node/Per_002dObject-Flags.html
which specifically advocates against per object flags.
I can follow your feedback if you still think per-object flag is the
right thing to do for libopus.
Please let me know either way.
>
> $(SSE_OBJ): CFLAGS += -msse4.1
>
>> +void (*const CELT_PITCH_XCORR_IMPL[OPUS_ARCHMASK+1])(const opus_val16 *,
>> + const opus_val16 *, opus_val32 *, int , int, int) = {
>> + celt_pitch_xcorr_c, /* ARMv4 */
>> + celt_pitch_xcorr_c, /* EDSP */
>> + celt_pitch_xcorr_c, /* Media */
>> +#if defined(OPUS_ARM_NEON_INTR)
>> + celt_pitch_xcorr_float_neon /* Neon */
>
> Please do not use tabs in source code (this applies here and everywhere
> below). Even with the tabs expanded in context, the comments here do not
> line up properly.
Will do. Thanks.
>
>> +static void xcorr_kernel_neon_float(float *x, float *y, float sum[4], int len) {
>
> x and y should be const.
>
>> + float32x4_t YY[5];
>> + float32x4_t XX[4];
>> + float32x2_t XX_2;
>> + float32x4_t SUMM[4];
>> + float *xi = x;
>> + float *yi = y;
>> + int cd = len/4;
>> + int cr = len%4;
>
> len is signed, so / and % are NOT equivalent to the corresponding >> and
> & (they are much slower).
>
>> + int j;
>> +
>> + celt_assert(len>=3);
>> +
>> + /* Initialize sums to 0 */
>> + SUMM[0] = vdupq_n_f32(0);
>> + SUMM[1] = vdupq_n_f32(0);
>> + SUMM[2] = vdupq_n_f32(0);
>> + SUMM[3] = vdupq_n_f32(0);
>> +
>> + YY[0] = vld1q_f32(yi);
>> +
>> + /* Each loop consumes 8 floats in y vector
>> + * and 4 floats in x vector
>> + */
>> + for (j = 0; j < cd; j++) {
>> + yi += 4;
>> + YY[4] = vld1q_f32(yi);
>
> If len == 4, then in the first iteration you will have loaded 8 y
> values, but only 7 are guaranteed to be available (e.g., the C code only
> references y[0] up to y[len-1+3]). You need to end this loop early and
> fall back to another approach. See comments in celt_pitch_xcorr_arm.s
> for details and an example (there are other useful comments there that
> could shave another cycle or two from this inner loop).
Analyzed the implementation in celt_pitch_xcorr_arm.s. I will re-do my
implementation to follow same algorithm. It seems more elegant.
This comment applies to rest of feedback on celt_neon_intr.c
>
>> + YY[1] = vextq_f32(YY[0], YY[4], 1);
>> + YY[2] = vextq_f32(YY[0], YY[4], 2);
>> + YY[3] = vextq_f32(YY[0], YY[4], 3);
>> +
>> + XX[0] = vld1q_dup_f32(xi++);
>> + XX[1] = vld1q_dup_f32(xi++);
>> + XX[2] = vld1q_dup_f32(xi++);
>> + XX[3] = vld1q_dup_f32(xi++);
>
> Don't do this. Do a single load and use vmlaq_lane_f32() to multiply by
> each value. That should cut at least 5 cycles out of this loop.
>
>> +
>> + SUMM[0] = vmlaq_f32(SUMM[0], XX[0], YY[0]);
>> + SUMM[1] = vmlaq_f32(SUMM[1], XX[1], YY[1]);
>> + SUMM[2] = vmlaq_f32(SUMM[2], XX[2], YY[2]);
>> + SUMM[3] = vmlaq_f32(SUMM[3], XX[3], YY[3]);
>> + YY[0] = YY[4];
>> + }
>> +
>> + /* Handle remaining values max iterations = 3 */
>> + for (j = 0; j < cr; j++) {
>> + YY[0] = vld1q_f32(yi++);
>
> This load is always redundant in the first iteration, which is a bit
> unfortunate.
>
>> + XX_2 = vld1_lane_f32(xi++, XX_2, 0);
>
> Don't load a single lane when you don't need the value(s) in the other
> lane(s). Use vld1_dup_f32() instead. It's faster and breaks dependencies.
Will keep this in mind. Thanks.
>
>> + SUMM[0] = vmlaq_lane_f32(SUMM[0], YY[0], XX_2, 0);
>> + }
>> +
>> + SUMM[0] = vaddq_f32(SUMM[0], SUMM[1]);
>> + SUMM[2] = vaddq_f32(SUMM[2], SUMM[3]);
>> + SUMM[0] = vaddq_f32(SUMM[0], SUMM[2]);
>> +
>> + vst1q_f32(sum, SUMM[0]);
>> +}
>> +
>> +void celt_pitch_xcorr_float_neon(const opus_val16 *_x, const opus_val16 *_y,
>> + opus_val32 *xcorr, int len, int max_pitch, int arch) {
>
> arch is unused. There's no reason to pass it here. If we're here, we
> know what the arch is.
>
>> + int i, j;
>> +
>> + celt_assert(max_pitch > 0);
>> + celt_assert((((unsigned char *)_x-(unsigned char *)NULL)&3)==0);
>> +
>> + for (i = 0; i < (max_pitch-3); i += 4) {
>> + xcorr_kernel_neon_float((float *)_x, (float *)_y+i,
>> + (float *)xcorr+i, len);
>> + }
>> +
>> + /* In case max_pitch isn't multiple of 4, do unrolled version */
>> + for (; i < max_pitch; i++) {
>> + float sum = 0;
>> + float *yi = _y+i;
>> + for (j = 0; j < len; j++)
>> + sum += _x[i]*yi[i];
>> + xcorr[i] = sum;
>> + }
>
> This loop can still be largely vectorized. Any reason not to do so?
Should have put TBD in the comments..I will try to get this done in the re-spin.
>
>> +}
>> diff --git a/celt/arm/pitch_arm.h b/celt/arm/pitch_arm.h
>> index a07f8ac..f5adc48 100644
>> --- a/celt/arm/pitch_arm.h
>> +++ b/celt/arm/pitch_arm.h
>> @@ -52,6 +52,19 @@ opus_val32 celt_pitch_xcorr_edsp(const opus_val16 *_x, const opus_val16 *_y,
>> ((void)(arch),PRESUME_NEON(celt_pitch_xcorr)(_x, _y, xcorr, len, max_pitch))
>> # endif
>>
>> -# endif
>> +#else /* Start !FIXED_POINT */
>> +/* Float case */
>> +#if defined(OPUS_ARM_NEON_INTR)
>> +void celt_pitch_xcorr_float_neon(const opus_val16 *_x, const opus_val16 *_y,
>> + opus_val32 *xcorr, int len, int max_pitch, int arch);
>> +#endif
>> +
>> +#if !defined(OPUS_HAVE_RTCD) && defined(OPUS_PRESUME_ARM_NEON_INTR)
>> +#define OVERRIDE_PITCH_XCORR (1)
>> +#define celt_pitch_xcorr(_x, _y, xcorr, len, max_pitch, arch) \
>> + ((void)(arch),celt_pitch_xcorr_float_neon(_x, _y, xcorr, \
>> + len, max_pitch, arch))
>
> Again, don't pass arch.
I really did not want to pass "arch", but couldn't figure out a way to
avoid passing arch as the
signature of
celt_pitch_xcorr_c(const opus_val16 *_x, const opus_val16 *_y,
opus_val32 *xcorr, int len, int max_pitch, int arch)
has arch as the parameter.
Also, xcorr_kernel_c() which is called by celt_pitch_xcorr_c() could
also be optimized in isolation
for some other CPU architecture.
I really don't even understand how the FIXED point code is even
working.. doesn't it face this
same problem? Don't you get compile error that function signatures are
not matching?
I tried as much to be least disruptive here, but my other alternative
thought is that I'm really not
sure why arch is being passed for each call to celt_pitch_xcorr() and
we do a table lookup for
the appropriate function every time.
Instead, can't we just see what arch we are running on during (may be)
opus_custom_decoder_init() or some appropriate function that must be called..
which we do any ways, and then assign the function pointers to any
optimized or c functions at that time?
>
>> (*const CELT_PITCH_XCORR_IMPL[OPUS_ARCHMASK+1])(const opus_val16 *,
>> - const opus_val16 *, opus_val32 *, int, int);
>> + const opus_val16 *, opus_val32 *, int, int
>> +#if !defined(FIXED_POINT)
>> + ,int
>> +#endif
>> +);
>
> Which gets rid of this ugliness.
>
>> +#if defined(FIXED_POINT)
>> +# define celt_pitch_xcorr(_x, _y, xcorr, len, max_pitch, arch) \
>> + ((*CELT_PITCH_XCORR_IMPL[(arch)&OPUS_ARCHMASK])(_x, _y, \
>> + xcorr, len, max_pitch)
>> +#else
>> # define celt_pitch_xcorr(_x, _y, xcorr, len, max_pitch, arch) \
>> ((*CELT_PITCH_XCORR_IMPL[(arch)&OPUS_ARCHMASK])(_x, _y, \
>> - xcorr, len, max_pitch))
>> + xcorr, len, max_pitch, arch))
>> +#endif
>> +
>
> And this.
>
>> diff --git a/configure.ac b/configure.ac
>> index 9b2f51f..09657b6 100644
>> --- a/configure.ac
>> +++ b/configure.ac
>> @@ -198,12 +198,11 @@ cpu_arm=no
>>
>> AS_IF([test x"${enable_asm}" = x"yes"],[
>> inline_optimization="No ASM for your platform, please send patches"
>> + OPUS_ARM_NEON_INTR_CPPFLAGS=
>> case $host_cpu in
>> arm*)
>> - dnl Currently we only have asm for fixed-point
>> - AS_IF([test "$enable_float" != "yes"],[
>> cpu_arm=yes
>> - AC_DEFINE([OPUS_ARM_ASM], [], [Make use of ARM asm optimization])
>> + AC_DEFINE([OPUS_ARM_ASM], [], [Make use of ARM asm/intrinsic optimization])
>
> Not sure I'm in love with conflating intrinsics with inline assembly.
More information about the opus
mailing list