[opus] [RFC PATCH v1 2/2] armv7(float): Optimize encode usecase using NE10 library

Thu Jan 29 09:39:21 PST 2015

Viswanath Puttagunta wrote:
>   if OPUS_ARM_NEON_INTR
>   CELT_ARM_NEON_INTR_OBJ = $(CELT_SOURCES_ARM_NEON_INTR:.c=.lo) \
> -			%test_unit_rotation.o %test_unit_mathops.o
> -$(CELT_ARM_NEON_INTR_OBJ): CFLAGS += $(OPUS_ARM_NEON_INTR_CPPFLAGS)
> +                        $(CELT_SOURCES_ARM_NE10:.c=.lo) \
> +			               %test_unit_rotation.o %test_unit_mathops.o \
> +                        %test_unit_mdct.o %test_unit_dft.o

Crazy indentation.

> +$(CELT_ARM_NEON_INTR_OBJ): CFLAGS += $(OPUS_ARM_NEON_INTR_CPPFLAGS) $(NE10_CFLAGS)
>   endif
> diff --git a/celt/arm/arm_celt_ne10_fft_map.c b/celt/arm/arm_celt_ne10_fft_map.c
> new file mode 100644
> index 0000000..5bb7b5f

Please put these tables (and the ones in arm_celt_ne10_mdct_map) in 
arm_celt_map.c... the goal was to have them all in the same place, so 
that if we changed the architectures supported or something, we did not 
have to hunt all over the codebase to update them all.

> +int (*const OPUS_FFT_ALLOC_ARCH[OPUS_ARCHMASK+1])(kiss_fft_state *st) = {

Please follow the naming convention used in arm_celt_map, e.g., 
<all_caps_function_name>_IMPL instead of _ARCH (on this and all the 
other tables).

> +void (*const OPUS_FFT[OPUS_ARCHMASK+1])(const kiss_fft_state *cfg,
> +                                       const kiss_fft_cpx *fin,
> +                                       kiss_fft_cpx *fout) = {

These are mis-aligned.

> +   st->priv = (void *)ne10_fft_alloc_c2c_float32_neon(st->nfft);
> +   if (st->priv == NULL) {
> +      printf("Unable to ne10 alloc\n");

Absolutely no printfs in library code.

> +      return -1;
> +   }
> +   return 0;
> +}
> +
> +void opus_fft_free_arm_float_neon(kiss_fft_state *st)
> +{
> +   ne10_fft_cfg_float32_t cfg = (ne10_fft_cfg_float32_t)st->priv;
> +
> +   if (cfg)
> +      free((void *)cfg);

This concerns me for several reasons:

1) We never call free() directly in libopus. It is always wrapped in the 
opus_free() macro to allow ports to override it (and as a debugging tool).

2) We didn't call malloc here, NE10 did, which is in a separate module. 
That will work in some environments (Linux), but in others (say, Windows 
Phone), memory allocated in one module MUST be freed by the same module, 
because different modules do not share libc state (indeed, they can even 
be linked against different implementations of libc). If this is how the 
API of NE10 is designed to work, then that API needs to be fixed.

> +}
> +#endif
> +
> +void opus_fft_float_neon(const kiss_fft_state *st,
> +                        const kiss_fft_cpx *fin,
> +                        kiss_fft_cpx *fout)

Again, these are mis-aligned.

> +{
> +   ne10_fft_cfg_float32_t cfg = (ne10_fft_cfg_float32_t)st->priv;
> +   VARDECL(ne10_fft_cpx_float32_t, temp);
> +   VARDECL(ne10_fft_cpx_float32_t, tempin);

Just another note on API design... the _t suffix is reserved by POSIX, 
and should never be used by user code. It's unlikely to cause issues 
with these long type names, but I have certainly seen it cause issues 
elsewhere (and using the style encourages others to do it).

> +   SAVE_STACK;
> +   int N2 = st->nfft >> 1;
> +   float32x4_t inq, outq;
> +   float32x2_t scale;
> +   float *in = (float *)fin;

You're dropping the const qualifier for no reason. Also, vld1q_f32() 
takes a const float32_t *, NOT a float *, and they are not compatible on 
all compiler versions.

> +   float *out;
> +   int i;
> +   ALLOC(temp, st->nfft, ne10_fft_cpx_float32_t);
> +   ALLOC(tempin, st->nfft, ne10_fft_cpx_float32_t);

This seems like a fairly large increase in peak stack usage (7.5 kB). Is 
there any chance of reducing this?

For example, presumably the first thing ne10_fft_c2c_1d_float32_neon() 
does is bit-reverse the input... if it could be modified to do the 
scaling at the same time, that would get rid of one buffer. The other 
option is to make opus_fft's callers do it (clt_mdct_forward already 
does this in the C version). The other other option is to modify the API 
to take a mutable input buffer, but combining the scaling with another 
loop will reduce the number of passes (and thus load/stores), which will 
likely be faster. That requires arch-specific code for the call in 
analysis.c (and the tests), though.

clt_mdct_forward() has a buffer f that is the right size and is not used 
during the opus_fft() call. Somehow being able to re-use that buffer 
would get rid of the other temporary buffer here. You'd still need to 
allocate one for the call in analysis.c, but that's at least near the 
top of the stack (of course, better yet would be to eliminate the need 
for this buffer in NE10 entirely).

The exact approach here really depends on our ability to modify the NE10 
API, but I'm getting the impression that clt_mdct_forward_float_neon() 
should probably not call opus_fft() at all (but directly access the NE10 
API), just as the C version directly accesses opus_fft_impl().

> +
> +   out = (float *)tempin;

These are pretty confusing names (if you have to keep this scaling 
here). Ideally they'd be related since they refer to the same memory 
(e.g., scaled and scaledp or something).

Also, float is _not_ compatible with float32_t (which is what vst1q_f32 
takes) in all compiler versions. Please do not mix and match them.

> +   scale = vld1_dup_f32(&st->scale);

Needs a (const float32_t *) cast.

> +   for (i = 0; i < N2; i++) {
> +      inq = vld1q_f32(in);
> +      in += 4;
> +      outq = vmulq_lane_f32(inq, scale, 0);
> +      vst1q_f32(out, outq);
> +      out += 4;
> +   }
> +
> +   cfg->buffer = (ne10_fft_cpx_float32_t *)&temp[0];

If the struct name is "buffer", probably better to have the temporary 
named "buffer" too, since it's not used anywhere else.

> +   ne10_fft_c2c_1d_float32_neon((ne10_fft_cpx_float32_t *)fout,
> +                                 (ne10_fft_cpx_float32_t *)tempin,
> +                                 cfg, 0);

More mis-alignment.

> +   /* N/4 complex FFT, does not downscale anymore */
> +   opus_fft(st, f2, (kiss_fft_cpx *)f, opus_select_arch());

Because you removed the scaling from the above loop, this comment is 
inaccurate.

> diff --git a/celt/arm/fft_arm.h b/celt/arm/fft_arm.h
> new file mode 100644
> index 0000000..16f008b
> --- /dev/null
> +++ b/celt/arm/fft_arm.h
> @@ -0,0 +1,65 @@
> +/* Copyright (c) 2015-2016 Xiph.Org Foundation

Are you from the future?

> +ifdef HAVE_ARM_NE10
> +CC = gcc
> +CFLAGS += -mfpu=neon
> +INCLUDES += -I$(NE10_INCDIR) -DHAVE_ARM_NE10 -DOPUS_ARM_NEON_INTR
> +LIBDIR = -l:$(NE10_LIBDIR)/libNE10.so
> +SOURCES += ../arm/celt_neon_intr.c dump_mode_arm_ne10.c
> +endif

It's a bit unfortunate that this depends on having NE10 available, since 
it's used to generate static files which someone may ultimately build on 
a completely different system, but I'm not sure how much effort it's 
worth to try to fix that. Probably not much.

> diff --git a/celt/dump_modes/dump_mode_arm_ne10.c b/celt/dump_modes/dump_mode_arm_ne10.c
> new file mode 100644
> index 0000000..30c7423
> --- /dev/null
> +++ b/celt/dump_modes/dump_mode_arm_ne10.c

"dump_modes_arm_ne10.c"

> +         fprintf(file, "{%f,%f},%c", cfg->twiddles[j].r, cfg->twiddles[j].i,(j+4)%3==0?'\n':' ');

Please use the same conversion specification as dump_modes.c for FLOAT, 
e.g., "%#0.8gf".

> +   fprintf(file, "\n#ifdef HAVE_ARM_NE10\n");
> +   fprintf(file, "#define OVERRIDE_FFT 1\n");
> +   fprintf(file, "#include \"%s\"\n", ARM_NE10_ARCH_FILE_NAME);
> +   fprintf(file, "#endif\n");

At least if you do generate the files on a system without NE10, and then 
build on a system with NE10, this will fail, which is probably the right 
thing.

> @@ -205,7 +220,6 @@ void dump_modes(FILE *file, CELTMode **modes, int nb_modes)
>         fprintf(file, "#endif\n");
>         fprintf(file, "\n");
>
> -
>         /* Print the actual mode data */
>         fprintf(file, "static const CELTMode mode%d_%d_%d = {\n", mode->Fs, mdctSize, mode->overlap);
>         fprintf(file, INT32 ",    /* Fs */\n", mode->Fs);

Irrelevant whitespace change.

>      {
> +      opus_fft_free_arch((kiss_fft_state *)cfg, opus_select_arch());
>         opus_free((opus_int16*)cfg->bitrev);

Wrong indentation.

> @@ -59,6 +60,7 @@ extern "C" {
>   #   define kiss_twiddle_scalar float
>   #   define KF_SUFFIX _celt_single
>   # endif
> +
>   #endif
>
>   typedef struct {

Irrelevant whitespace change.

> @@ -87,8 +89,13 @@ typedef struct kiss_fft_state{
>       opus_int16 factors[2*MAXFACTORS];
>       const opus_int16 *bitrev;
>       const kiss_twiddle_cpx *twiddles;
> +    void *priv; /* Used by arch specfic optimizations */

Wrong indentation.

>   } kiss_fft_state;

Can I get a copy of the
/* ARM NE10 library does not support below values */
comment you added below here as well?

> +#ifndef HAVE_ARM_NE10
>           test1d(36,0);
>           test1d(36,1);
>           test1d(50,0);
>           test1d(50,1);
> +#endif

>   celt/static_modes_fixed.h \
> +celt/static_modes_float_arm_ne10.h \
>   celt/arm/armcpu.h \

Wrong indentation.

>   celt/arm/fixed_armv4.h \
>   celt/arm/fixed_armv5e.h \
>   celt/arm/kiss_fft_armv4.h \
>   celt/arm/kiss_fft_armv5e.h \
>   celt/arm/pitch_arm.h \
> +celt/arm/fft_arm.h \
> +celt/arm/mdct_arm.h \
>   celt/x86/pitch_sse.h \

Etc.