[opus] [Aarch64 00/11] Patches to enable Aarch64

Fri Nov 13 10:51:50 PST 2015

Hi Jonathan,

I'm sorry to bring this up again, and I don't want to beat a dead horse, 
but I was very surprised by your benchmarks so I took a little closer look.

I think what's happening is that it's a little unfair to compare the 
ARM64 inline assembly to the C code, because looking at the C macros in 
"fixed_generic.h" for MULT16_32_Q16 and MULT16_32_Q15 you find they are 
implemented with two multiplies, two shifts, an AND and an ADD. It's not 
hard for me to believe that your inline assembly is faster than that 
mess. But on a 64-bit machine, there's no reason to go through all that 
when a simple 64-bit multiply will do. The SILK macros in "macros.h" 
already make this distinction but you don't see it in your benchmarks 
because the OPUS_FAST_INT64 macro only looks for 64-bit x86, and doesn't 
check for ARM64.
I don't think it's a coincidence that the macros you didn't replace only 
performed one multiply while the ones you did replace performed two.

I think it would be very interesting to try the benchmarks again after 
adding detection for __aarch64__ to OPUS_FAST_INT64, and replacing 
MULT16_32_Q15 with something like:
#define MULT16_32_Q15(a,b) 
((opus_val32)SHR((opus_int64)((opus_val16)(a))*(b),15))
and MULT16_32_Q16 with:
#define MULT16_32_Q16(a,b) 
((opus_val32)SHR((opus_int64)((opus_val16)(a))*(b),16))
There isn't an "opus_val64" or I would have used it (it would be defined 
as "float" for floating point).

I think the problem here is that "fixed_generic.h" needs to be updated 
for 64-bit machines the way "macros.h" was. x86 64-bit machines will 
perform just as poorly using the current macros as ARM64 does.

Also, I think you need to use the "fixed_generic.h" code as the 
reference for any replacements. I noticed that MULT16_32_Q15_arm64 went 
out of its way to clear the LSB of the result, which matches what the 
32-bit ARM inline assembly returns, but *doesn't* match what 
"fixed_generic.h" returns. As I understand it, losing the LSB in 32-bit 
ARM was a compromise between speed and accuracy, but there shouldn't be 
any reason to do that in ARM64.

Anyway I'll stop talking now. I'm not saying that the inline assembly 
isn't faster, but I don't think it's giving you as much of a gain over C 
as you think.

--John Ridges

On 11/13/2015 9:30 AM, Jonathan Lennox wrote:
>> On Nov 12, 2015, at 12:23 PM, John Ridges <jridges at masque.com> wrote:
>>
>> One other minor thing: I notice that in the inline assembly the result (rd) is constrained as an earlyclobber operand. What was the reason for that?
> Possibly an error?  Probably from modeling it on macros_armv4.h, which I guess does require earlyclobber.  I’ll ask the developer.