[opus] [RFC PATCH v1 2/2] armv7(float): Optimize encode usecase using NE10 library

Viswanath Puttagunta viswanath.puttagunta at linaro.org
Thu Jan 29 15:13:34 PST 2015


Hi Timothy,

Appreciate the comprehensive code review.

The biggest issue I see is the peak stack usage.... rest looks like
fairly straight forward cleanup.

Is the peak stack usage a complete blocker in current form?
If it is indeed a blocker, would it be acceptable if we can reduce
additional buffer requirement from 2 buffers (current) to 1, possibly
by moving scaling inside ne10_fft_c2c_1d_float32_neon? (I will let
Phil comment on if there is any reason not to do this.. and suggest
any alternatives if necessary)

I will work on fixing up code from opus side.

Phil is one of developers/maintainers of opus. Please correct me if
any of information below is inaccurate from a libNE10 change request
perspective.

Phil,

Below are 2 main things we need to change from NE10 Library side. Can
you please make these changes on NE10 side?

1. Remove usage of _t in ne10_fft_cfg_float32_t and ne10_fft_cpx_float32_t
-------from comment----
ne10_fft_cfg_float32_t cfg = (ne10_fft_cfg_float32_t)st->priv;
+   VARDECL(ne10_fft_cpx_float32_t, temp);
+   VARDECL(ne10_fft_cpx_float32_t, tempin);

Just another note on API design... the _t suffix is reserved by POSIX,
and should never be used by user code. It's unlikely to cause issues
with these long type names, but I have certainly seen it cause issues
elsewhere (and using the style encourages others to do it).
-------
It should be fairly trivial to fix by adding something like this into
NE10 source with something like
typedef ne10_fft_cpx_float32 ne10_fft_cpx_float32_t
typedef ne10_fft_cpx_float32 ne10_fft_cpx_float32_t

Then from opus side, I can instead use
ne10_fft_cfg_float32 cfg = (ne10_fft_cfg_float32)st->priv;
VARDECL(ne10_fft_cpx_float32, temp);
VARDECL(ne10_fft_cpx_float32, tempin);

2. opus_fft_float_neon: Remove at least one additional buffer that is
currently needed (on stack)
- Currently, we have to use 2 additional buffers on stack.
- One for scaling (ALLOC(tempin, st->nfft, ne10_fft_cpx_float32_t);)
- Another as temp buffer (cfg->buffer)
- Is there any reason the scaling cannot be inside
ne10_fft_c2c_1d_float32_neon and just re-use cfg->buffer for scaling
as well?
- This is probably the easiest way to get rid of one additional buffer.
Note: Timothy, I remember Phil telling me that NE10 fft algorithm does
not use bit reversal. (Phil, correct if I'm mistaken here)

Regards,
Vish

On 29 January 2015 at 11:39, Timothy B. Terriberry <tterribe at xiph.org> wrote:
> Viswanath Puttagunta wrote:
>>
>>   if OPUS_ARM_NEON_INTR
>>   CELT_ARM_NEON_INTR_OBJ = $(CELT_SOURCES_ARM_NEON_INTR:.c=.lo) \
>> -                       %test_unit_rotation.o %test_unit_mathops.o
>> -$(CELT_ARM_NEON_INTR_OBJ): CFLAGS += $(OPUS_ARM_NEON_INTR_CPPFLAGS)
>> +                        $(CELT_SOURCES_ARM_NE10:.c=.lo) \
>> +                                      %test_unit_rotation.o
>> %test_unit_mathops.o \
>> +                        %test_unit_mdct.o %test_unit_dft.o
>
>
> Crazy indentation.
>
>> +$(CELT_ARM_NEON_INTR_OBJ): CFLAGS += $(OPUS_ARM_NEON_INTR_CPPFLAGS)
>> $(NE10_CFLAGS)
>>   endif
>> diff --git a/celt/arm/arm_celt_ne10_fft_map.c
>> b/celt/arm/arm_celt_ne10_fft_map.c
>> new file mode 100644
>> index 0000000..5bb7b5f
>
>
> Please put these tables (and the ones in arm_celt_ne10_mdct_map) in
> arm_celt_map.c... the goal was to have them all in the same place, so that
> if we changed the architectures supported or something, we did not have to
> hunt all over the codebase to update them all.
>
>> +int (*const OPUS_FFT_ALLOC_ARCH[OPUS_ARCHMASK+1])(kiss_fft_state *st) = {
>
>
> Please follow the naming convention used in arm_celt_map, e.g.,
> <all_caps_function_name>_IMPL instead of _ARCH (on this and all the other
> tables).
>
>> +void (*const OPUS_FFT[OPUS_ARCHMASK+1])(const kiss_fft_state *cfg,
>> +                                       const kiss_fft_cpx *fin,
>> +                                       kiss_fft_cpx *fout) = {
>
>
> These are mis-aligned.
>
>> +   st->priv = (void *)ne10_fft_alloc_c2c_float32_neon(st->nfft);
>> +   if (st->priv == NULL) {
>> +      printf("Unable to ne10 alloc\n");
>
>
> Absolutely no printfs in library code.
>
>> +      return -1;
>> +   }
>> +   return 0;
>> +}
>> +
>> +void opus_fft_free_arm_float_neon(kiss_fft_state *st)
>> +{
>> +   ne10_fft_cfg_float32_t cfg = (ne10_fft_cfg_float32_t)st->priv;
>> +
>> +   if (cfg)
>> +      free((void *)cfg);
>
>
> This concerns me for several reasons:
>
> 1) We never call free() directly in libopus. It is always wrapped in the
> opus_free() macro to allow ports to override it (and as a debugging tool).
>
> 2) We didn't call malloc here, NE10 did, which is in a separate module. That
> will work in some environments (Linux), but in others (say, Windows Phone),
> memory allocated in one module MUST be freed by the same module, because
> different modules do not share libc state (indeed, they can even be linked
> against different implementations of libc). If this is how the API of NE10
> is designed to work, then that API needs to be fixed.
>
>> +}
>> +#endif
>> +
>> +void opus_fft_float_neon(const kiss_fft_state *st,
>> +                        const kiss_fft_cpx *fin,
>> +                        kiss_fft_cpx *fout)
>
>
> Again, these are mis-aligned.
>
>> +{
>> +   ne10_fft_cfg_float32_t cfg = (ne10_fft_cfg_float32_t)st->priv;
>> +   VARDECL(ne10_fft_cpx_float32_t, temp);
>> +   VARDECL(ne10_fft_cpx_float32_t, tempin);
>
>
> Just another note on API design... the _t suffix is reserved by POSIX, and
> should never be used by user code. It's unlikely to cause issues with these
> long type names, but I have certainly seen it cause issues elsewhere (and
> using the style encourages others to do it).
>
>> +   SAVE_STACK;
>> +   int N2 = st->nfft >> 1;
>> +   float32x4_t inq, outq;
>> +   float32x2_t scale;
>> +   float *in = (float *)fin;
>
>
> You're dropping the const qualifier for no reason. Also, vld1q_f32() takes a
> const float32_t *, NOT a float *, and they are not compatible on all
> compiler versions.
>
>> +   float *out;
>> +   int i;
>> +   ALLOC(temp, st->nfft, ne10_fft_cpx_float32_t);
>> +   ALLOC(tempin, st->nfft, ne10_fft_cpx_float32_t);
>
>
> This seems like a fairly large increase in peak stack usage (7.5 kB). Is
> there any chance of reducing this?
>
> For example, presumably the first thing ne10_fft_c2c_1d_float32_neon() does
> is bit-reverse the input... if it could be modified to do the scaling at the
> same time, that would get rid of one buffer. The other option is to make
> opus_fft's callers do it (clt_mdct_forward already does this in the C
> version). The other other option is to modify the API to take a mutable
> input buffer, but combining the scaling with another loop will reduce the
> number of passes (and thus load/stores), which will likely be faster. That
> requires arch-specific code for the call in analysis.c (and the tests),
> though.
>
> clt_mdct_forward() has a buffer f that is the right size and is not used
> during the opus_fft() call. Somehow being able to re-use that buffer would
> get rid of the other temporary buffer here. You'd still need to allocate one
> for the call in analysis.c, but that's at least near the top of the stack
> (of course, better yet would be to eliminate the need for this buffer in
> NE10 entirely).
>
> The exact approach here really depends on our ability to modify the NE10
> API, but I'm getting the impression that clt_mdct_forward_float_neon()
> should probably not call opus_fft() at all (but directly access the NE10
> API), just as the C version directly accesses opus_fft_impl().
>
>> +
>> +   out = (float *)tempin;
>
>
> These are pretty confusing names (if you have to keep this scaling here).
> Ideally they'd be related since they refer to the same memory (e.g., scaled
> and scaledp or something).
>
> Also, float is _not_ compatible with float32_t (which is what vst1q_f32
> takes) in all compiler versions. Please do not mix and match them.
>
>> +   scale = vld1_dup_f32(&st->scale);
>
>
> Needs a (const float32_t *) cast.
>
>> +   for (i = 0; i < N2; i++) {
>> +      inq = vld1q_f32(in);
>> +      in += 4;
>> +      outq = vmulq_lane_f32(inq, scale, 0);
>> +      vst1q_f32(out, outq);
>> +      out += 4;
>> +   }
>> +
>> +   cfg->buffer = (ne10_fft_cpx_float32_t *)&temp[0];
>
>
> If the struct name is "buffer", probably better to have the temporary named
> "buffer" too, since it's not used anywhere else.
>
>> +   ne10_fft_c2c_1d_float32_neon((ne10_fft_cpx_float32_t *)fout,
>> +                                 (ne10_fft_cpx_float32_t *)tempin,
>> +                                 cfg, 0);
>
>
> More mis-alignment.
>
>> +   /* N/4 complex FFT, does not downscale anymore */
>> +   opus_fft(st, f2, (kiss_fft_cpx *)f, opus_select_arch());
>
>
> Because you removed the scaling from the above loop, this comment is
> inaccurate.
>
>> diff --git a/celt/arm/fft_arm.h b/celt/arm/fft_arm.h
>> new file mode 100644
>> index 0000000..16f008b
>> --- /dev/null
>> +++ b/celt/arm/fft_arm.h
>> @@ -0,0 +1,65 @@
>> +/* Copyright (c) 2015-2016 Xiph.Org Foundation
>
>
> Are you from the future?
>
>> +ifdef HAVE_ARM_NE10
>> +CC = gcc
>> +CFLAGS += -mfpu=neon
>> +INCLUDES += -I$(NE10_INCDIR) -DHAVE_ARM_NE10 -DOPUS_ARM_NEON_INTR
>> +LIBDIR = -l:$(NE10_LIBDIR)/libNE10.so
>> +SOURCES += ../arm/celt_neon_intr.c dump_mode_arm_ne10.c
>> +endif
>
>
> It's a bit unfortunate that this depends on having NE10 available, since
> it's used to generate static files which someone may ultimately build on a
> completely different system, but I'm not sure how much effort it's worth to
> try to fix that. Probably not much.
>
>> diff --git a/celt/dump_modes/dump_mode_arm_ne10.c
>> b/celt/dump_modes/dump_mode_arm_ne10.c
>> new file mode 100644
>> index 0000000..30c7423
>> --- /dev/null
>> +++ b/celt/dump_modes/dump_mode_arm_ne10.c
>
>
> "dump_modes_arm_ne10.c"
>
>> +         fprintf(file, "{%f,%f},%c", cfg->twiddles[j].r,
>> cfg->twiddles[j].i,(j+4)%3==0?'\n':' ');
>
>
> Please use the same conversion specification as dump_modes.c for FLOAT,
> e.g., "%#0.8gf".
>
>> +   fprintf(file, "\n#ifdef HAVE_ARM_NE10\n");
>> +   fprintf(file, "#define OVERRIDE_FFT 1\n");
>> +   fprintf(file, "#include \"%s\"\n", ARM_NE10_ARCH_FILE_NAME);
>> +   fprintf(file, "#endif\n");
>
>
> At least if you do generate the files on a system without NE10, and then
> build on a system with NE10, this will fail, which is probably the right
> thing.
>
>> @@ -205,7 +220,6 @@ void dump_modes(FILE *file, CELTMode **modes, int
>> nb_modes)
>>         fprintf(file, "#endif\n");
>>         fprintf(file, "\n");
>>
>> -
>>         /* Print the actual mode data */
>>         fprintf(file, "static const CELTMode mode%d_%d_%d = {\n",
>> mode->Fs, mdctSize, mode->overlap);
>>         fprintf(file, INT32 ",    /* Fs */\n", mode->Fs);
>
>
> Irrelevant whitespace change.
>
>>      {
>> +      opus_fft_free_arch((kiss_fft_state *)cfg, opus_select_arch());
>>         opus_free((opus_int16*)cfg->bitrev);
>
>
> Wrong indentation.
>
>> @@ -59,6 +60,7 @@ extern "C" {
>>   #   define kiss_twiddle_scalar float
>>   #   define KF_SUFFIX _celt_single
>>   # endif
>> +
>>   #endif
>>
>>   typedef struct {
>
>
> Irrelevant whitespace change.
>
>> @@ -87,8 +89,13 @@ typedef struct kiss_fft_state{
>>       opus_int16 factors[2*MAXFACTORS];
>>       const opus_int16 *bitrev;
>>       const kiss_twiddle_cpx *twiddles;
>> +    void *priv; /* Used by arch specfic optimizations */
>
>
> Wrong indentation.
>
>>   } kiss_fft_state;
>
>
> Can I get a copy of the
> /* ARM NE10 library does not support below values */
> comment you added below here as well?
>
>> +#ifndef HAVE_ARM_NE10
>>           test1d(36,0);
>>           test1d(36,1);
>>           test1d(50,0);
>>           test1d(50,1);
>> +#endif
>
>
>>   celt/static_modes_fixed.h \
>> +celt/static_modes_float_arm_ne10.h \
>>   celt/arm/armcpu.h \
>
>
> Wrong indentation.
>
>>   celt/arm/fixed_armv4.h \
>>   celt/arm/fixed_armv5e.h \
>>   celt/arm/kiss_fft_armv4.h \
>>   celt/arm/kiss_fft_armv5e.h \
>>   celt/arm/pitch_arm.h \
>> +celt/arm/fft_arm.h \
>> +celt/arm/mdct_arm.h \
>>   celt/x86/pitch_sse.h \
>
>
> Etc.


More information about the opus mailing list