[opus] [RFC PATCH v3] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

Fri Dec 19 05:22:03 PST 2014

On 19 December 2014 at 07:18, Viswanath Puttagunta
<viswanath.puttagunta at linaro.org> wrote:
> On 19 December 2014 at 05:26, Timothy B. Terriberry <tterribe at xiph.org> wrote:
>> Viswanath Puttagunta wrote:
>>> I responded to your feedback before I started on RFCv3.. and took your
>>> silence as approval :).. I guess that email got lost in your inbox sea
>>> some where.. so re-posting the responses.
>>
>> Sorry, I did see it but I guess I read it rather more quickly than I
>> thought. Apologies for that.
>>
>>> guidance. I wouldn't know where else to put this. Without
>>
>> Sorry for not being clear. I meant you shouldn't put the -O3 anywhere at
>> all.
>>
>>> "-mfpu=neon", the intrinsic code wouldn't even compile. And without
>>> -O3.. the instructions that are generated are just painful. Please
>>> advise.
>>
>> With what compiler? Using gcc 4.2 on my rather outdated Cortex A8 setup
>> (Greg is working on getting a more modern one configured for us), the
>> assembly output with -O3 and (the default) -O2 is identical. However, if
>> someone actually wants to make an unoptimized build for some reason
>> (e.g., testing or debugging), you've broken that.
> Thanks. After re-referring to the NEON Programmer's guide carefully
> http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.den0018a/index.html
> the appropriate flag should have been "-ftree-vectorize".
After further testing, just having -mfpu=neon is enough.. So, I'll just use this
to keep it consistent with the checks in configure.ac

>
> I can also confirm that -O2 and -O3 produces almost same output. What
> I was comparing when I meant "painful" is when I don't use any optimization
> level.. which I tried during unit testing. But it looks like for opus,
> the default
> optimization flag is -O2... so just adding -ftree-vectorize should be enough.
>
> FYI, -O3 implied -ftree-vectorize.. and that is why it worked in the
> past.
>
>>
>>> Now, that I unrolled it and tested it to make sure it works, is there
>>> a specific reason you think this is will be any slower than a simple
>>> loop and want to go back? I know it saves one load as compared to a
>>> simple loop. It would be hard to prove that it is measurably better
>>> than the simple loop. But I would sincerely prefer it this way.. It is
>>> fairly straight forward unroll in my opinion.
>>
>> It makes sense to optimize for executed cycles or readability/code size,
>> but this doesn't really do either.
>>
>> You'll see the code in celt_pitch_xcorr_arm.s processes the last samples
>> in groups of 2, 1, 1. It costs one extra compare (only one, because keep
>> in mind the switch has an extra compare since gcc is not smart enough to
>> prove that len==0 is impossible, despite our assert), but avoids an
>> indirect jump. That's all probably irrelevant, as the bottlenecks here
>> are going to be the stalls waiting for the loads and multiplies on the
>> NEON side, and this has one fewer of both when len is 3 or 4. If you
>> were going to fully unroll, that's kind of what I expected.
>>
>> I agree that doing the extra load is suboptimal, but you can still loop
>> the first three iterations, which means half the code. That's good for
>> code cache size and readability (i.e., Don't Repeat Yourself).
>>
>> If there were a measurable performance advantage to the switch
>> statement, that would change my opinion, but if there isn't, then either
>> of the other two approaches seem better.
>
> OK, thanks for your explanation. I will take a closer look at this part
> of celt_pitch_xcorr_arm.s
>
>>
>>
>> One more nit I hadn't noticed before:
>>
>>> +   float *xi = x;
>>> +   float *yi = y;
>>
>> These need to be const float32_t (in both xcorr_kernel_neon_float and
>> xcorr_kernel_neon_float_process1). They're currently causing a ton of
>> warning spew. float32_t appears to not be considered equivalent to
>> float, which means you'll also need casts here:
>>
>>> +   vst1q_f32(sum, SUMM);
>>
>> and here:
>>
>>> +   vst1_lane_f32(sum, SUMM_2[0], 0);
> Thanks, will do.
>> _______________________________________________
>> opus mailing list
>> opus at xiph.org
>> http://lists.xiph.org/mailman/listinfo/opus