[opus] [RFC PATCH v3] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

Fri Dec 19 05:18:39 PST 2014

On 19 December 2014 at 05:26, Timothy B. Terriberry <tterribe at xiph.org> wrote:
> Viswanath Puttagunta wrote:
>> I responded to your feedback before I started on RFCv3.. and took your
>> silence as approval :).. I guess that email got lost in your inbox sea
>> some where.. so re-posting the responses.
>
> Sorry, I did see it but I guess I read it rather more quickly than I
> thought. Apologies for that.
>
>> guidance. I wouldn't know where else to put this. Without
>
> Sorry for not being clear. I meant you shouldn't put the -O3 anywhere at
> all.
>
>> "-mfpu=neon", the intrinsic code wouldn't even compile. And without
>> -O3.. the instructions that are generated are just painful. Please
>> advise.
>
> With what compiler? Using gcc 4.2 on my rather outdated Cortex A8 setup
> (Greg is working on getting a more modern one configured for us), the
> assembly output with -O3 and (the default) -O2 is identical. However, if
> someone actually wants to make an unoptimized build for some reason
> (e.g., testing or debugging), you've broken that.
Thanks. After re-referring to the NEON Programmer's guide carefully
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.den0018a/index.html
the appropriate flag should have been "-ftree-vectorize".

I can also confirm that -O2 and -O3 produces almost same output. What
I was comparing when I meant "painful" is when I don't use any optimization
level.. which I tried during unit testing. But it looks like for opus,
the default
optimization flag is -O2... so just adding -ftree-vectorize should be enough.

FYI, -O3 implied -ftree-vectorize.. and that is why it worked in the
past.

>
>> Now, that I unrolled it and tested it to make sure it works, is there
>> a specific reason you think this is will be any slower than a simple
>> loop and want to go back? I know it saves one load as compared to a
>> simple loop. It would be hard to prove that it is measurably better
>> than the simple loop. But I would sincerely prefer it this way.. It is
>> fairly straight forward unroll in my opinion.
>
> It makes sense to optimize for executed cycles or readability/code size,
> but this doesn't really do either.
>
> You'll see the code in celt_pitch_xcorr_arm.s processes the last samples
> in groups of 2, 1, 1. It costs one extra compare (only one, because keep
> in mind the switch has an extra compare since gcc is not smart enough to
> prove that len==0 is impossible, despite our assert), but avoids an
> indirect jump. That's all probably irrelevant, as the bottlenecks here
> are going to be the stalls waiting for the loads and multiplies on the
> NEON side, and this has one fewer of both when len is 3 or 4. If you
> were going to fully unroll, that's kind of what I expected.
>
> I agree that doing the extra load is suboptimal, but you can still loop
> the first three iterations, which means half the code. That's good for
> code cache size and readability (i.e., Don't Repeat Yourself).
>
> If there were a measurable performance advantage to the switch
> statement, that would change my opinion, but if there isn't, then either
> of the other two approaches seem better.

OK, thanks for your explanation. I will take a closer look at this part
of celt_pitch_xcorr_arm.s

>
>
> One more nit I hadn't noticed before:
>
>> +   float *xi = x;
>> +   float *yi = y;
>
> These need to be const float32_t (in both xcorr_kernel_neon_float and
> xcorr_kernel_neon_float_process1). They're currently causing a ton of
> warning spew. float32_t appears to not be considered equivalent to
> float, which means you'll also need casts here:
>
>> +   vst1q_f32(sum, SUMM);
>
> and here:
>
>> +   vst1_lane_f32(sum, SUMM_2[0], 0);
Thanks, will do.
> _______________________________________________
> opus mailing list
> opus at xiph.org
> http://lists.xiph.org/mailman/listinfo/opus