<div dir="ltr">Hi Ulrich,<div><br></div><div>Making the typedef aligned can benefit more. As long as an instance sits in memory, be it static, stack, or heap, additional alignment can benefit. This patch came from the discovery that the performance randomly changes by just turning on and off LTO or modifying totally unrelated code linked to the final executable. If we just make that specific array aligned, it would be determined totally by the linker's "mood" whether the other occurrences are 4-bytes aligned or not.</div><div><br></div><div>I also suspect (though I have not verified) that for some architectures that are unable to access unaligned memory, making the typedef aligned would facilitate the compiler to generate faster code.</div><div><br></div><div>Best regards,</div><div>Zheng</div><div><br></div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, 24 Oct 2022 at 15:27, Ulrich Windl <<a href="mailto:Ulrich.Windl@rz.uni-regensburg.de">Ulrich.Windl@rz.uni-regensburg.de</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">>>> Zheng Lv <<a href="mailto:lvzheng@google.com" target="_blank">lvzheng@google.com</a>> schrieb am 24.10.2022 um 04:31 in Nachricht<br>
<<a href="mailto:CACNzCb422LVKKUzRSYJnDH09uaVgOABB1w6z48zYUXMPQvQAmg@mail.gmail.com" target="_blank">CACNzCb422LVKKUzRSYJnDH09uaVgOABB1w6z48zYUXMPQvQAmg@mail.gmail.com</a>>:<br>
> Hi Team,<br>
> <br>
> Can anyone take a look? We are already using this optimization in<br>
> production.<br>
> <br>
> Best regards,<br>
> Zheng<br>
> <br>
> On Thu, 15 Sept 2022 at 19:04, Zheng Lv <<a href="mailto:lvzheng@google.com" target="_blank">lvzheng@google.com</a>> wrote:<br>
> <br>
>> This makes kiss_twiddle_cpx 4-byte aligned (instead of 2-byte) for<br>
>> fixed-point builds. Tested with an armv6j+nofp development board, CELT<br>
>> encoding becomes 1.4x as fast, and decoding over 2x.<br>
>><br>
>> Performance gain is mostly attributed to the proper alignment of the<br>
>> static const array mdct_twiddles960.<br>
>><br>
>> Co-authored-by: David Gao <<a href="mailto:davidgao@google.com" target="_blank">davidgao@google.com</a>><br>
>> ---<br>
>> celt/kiss_fft.h | 12 +++++++++++-<br>
>> 1 file changed, 11 insertions(+), 1 deletion(-)<br>
>><br>
>> diff --git a/celt/kiss_fft.h b/celt/kiss_fft.h<br>
>> index bffa2bfa..267f72f9 100644<br>
>> --- a/celt/kiss_fft.h<br>
>> +++ b/celt/kiss_fft.h<br>
>> @@ -49,31 +49,41 @@ extern "C" {<br>
>> #ifdef FIXED_POINT<br>
>> #include "arch.h"<br>
>><br>
>> # define kiss_fft_scalar opus_int32<br>
>> # define kiss_twiddle_scalar opus_int16<br>
>><br>
>> +/* Some 32-bit CPUs would load/store a kiss_twiddle_cpx with a single<br>
>> memory<br>
>> + * access, and could benefit from additional alignment.<br>
>> + */<br>
>> +# define KISS_TWIDDLE_CPX_ALIGNMENT (sizeof(opus_int32))<br>
>><br>
>> #else<br>
>> # ifndef kiss_fft_scalar<br>
>> /* default is float */<br>
>> # define kiss_fft_scalar float<br>
>> # define kiss_twiddle_scalar float<br>
>> # define KF_SUFFIX _celt_single<br>
>> # endif<br>
>> #endif<br>
>><br>
>> +#if defined(__GNUC__) && defined(KISS_TWIDDLE_CPX_ALIGNMENT)<br>
>> +#define KISS_TWIDDLE_CPX_ALIGNED<br>
>> __attribute__((aligned(KISS_TWIDDLE_CPX_ALIGNMENT)))<br>
>> +#else<br>
>> +#define KISS_TWIDDLE_CPX_ALIGNED<br>
>> +#endif<br>
>> +<br>
>> typedef struct {<br>
>> kiss_fft_scalar r;<br>
>> kiss_fft_scalar i;<br>
>> }kiss_fft_cpx;<br>
>><br>
>> typedef struct {<br>
>> kiss_twiddle_scalar r;<br>
>> kiss_twiddle_scalar i;<br>
>> -}kiss_twiddle_cpx;<br>
>> +} KISS_TWIDDLE_CPX_ALIGNED kiss_twiddle_cpx;<br>
<br>
I'm only surprised that you do align the typedef and not the actual data!<br>
<br>
<br>
>><br>
>> #define MAXFACTORS 8<br>
>> /* e.g. an fft of length 128 has 4 factors<br>
>> as far as kissfft is concerned<br>
>> 4*4*4*2<br>
>> */<br>
>> --<br>
>> 2.37.2.789.g6183377224-goog<br>
>><br>
>><br>
<br>
<br>
<br>
<br>
</blockquote></div>