[tremor] ARM ASM performance gains, EVC vs. GCC

Sat Oct 5 12:07:42 PDT 2002

This very helpful post...

http://forums.pocketmatrix.com/viewtopic.php?t=4063&highlight=asm+compiler

talks about how to integrate ASM files into EVC.   So maybe just using GCC
as an ASM generater and then adding it to an EVC project is the way to get
it optimized for the Pocket PC.

Werner Sharp
werner at sharp-software.com

----- Original Message -----
From: "Nicolas Pitre" <nico at cam.org>
To: "Werner Sharp" <werner at sharp-software.com>; "marc dukette"
<dukette at adelphia.net>
Cc: <tremor at xiph.org>
Sent: Saturday, October 05, 2002 12:52 PM
Subject: Re: [tremor] ARM ASM performance gains, EVC vs. GCC

<p>> On Fri, 4 Oct 2002, Werner Sharp wrote:
>
> > Hi Nicolas,
> >
> > mdct386.asm is with #ifdef __i386__
> > mdct.asm is with #ifdef 1
> >
> > the #ifdef 1 version gets a 6% performance boost in the one file I
tried.
>
> Okay.
>
> First the explanation for the performance loss with my latest changes can
be
> easily explained.  From mdct386.asm:
>
> ; 366  :     XPROD31( iX[4], iX[6], T[0], T[1], &oX[2], &oX[3] ); T+=step;
>
> [...]
>         bl        XPROD31  ; 000000B0
>
> In misc.h the non __i386__ case defines functions like XPROD31 that are
> clearly meant to be inlined.  Until someone knows how to convince EVC to
> actually inline those functions you won't be able to benefit from the
> improvements those functions provide over the macros version.
>
> Let's have a look at the macro version then.  From mdct.asm:
>
> |$L1680|
>
> ; 363  :
> ; 364  :   do{
> ; 365  :     oX-=4;
> ; 366  :     XPROD31( iX[4], iX[6], T[0], T[1], &oX[2], &oX[3] ); T+=step;
>
>         ldr       r5, [r2]
>         sub       r3, r3, #0x10  ; 0x10 = 16
>         ldr       r4, [r0, #0x10]  ; 0x10 = 16
>         mov       r10, r5
>         mov       r10, r10, asr #31
>         mov       r9, r4
>         mul       r10, r9, r10
>         mov       r11, r4, asr #31
>         mul       r9, r11, r5
>         add       r11, r10, r9
>         umull     r9, r10, r4, r5
>         mul       r9, r4, r5
>         add       r6, r11, r10
>         ldr       r4, [r2, #4]
>         ldr       r5, [r0, #0x18]  ; 0x18 = 24
>         mov       r11, r4, asr #31
>         str       r9, [sp, #0x68]  ; 0x68 = 104
>         mov       r9, r4
>         mov       r10, r5, asr #31
>         mul       r10, r9, r10
>         mul       r9, r11, r5
>         add       r11, r10, r9
>         umull     r9, r10, r4, r5
>         mul       r9, r4, r5
>         add       r4, r11, r10
>         add       r11, r4, r6
>         mov       r11, r11, lsl #1
>         str       r9, [sp, #0x68]  ; 0x68 = 104
>         str       r11, [r3, #8]
>         ldr       r5, [r2]
>         ldr       r4, [r0, #0x18]  ; 0x18 = 24
>         mov       r10, r5
>         mov       r10, r10, asr #31
>         mov       r9, r4
>         mul       r10, r9, r10
>         mov       r11, r4, asr #31
>         mul       r9, r11, r5
>         add       r11, r10, r9
>         umull     r9, r10, r4, r5
>         mul       r9, r4, r5
>         ldr       r4, [r2, #4]
>         add       r6, r11, r10
>         ldr       r5, [r0, #0x10]  ; 0x10 = 16
>         mov       r11, r4, asr #31
>         str       r9, [sp, #0x68]  ; 0x68 = 104
>         mov       r9, r4
>         mov       r10, r5, asr #31
>         mul       r10, r9, r10
>         mul       r9, r11, r5
>         add       r2, lr, r2
>         add       r11, r10, r9
>         umull     r9, r10, r4, r5
>         mul       r9, r4, r5
>         add       r4, r11, r10
>         sub       r11, r6, r4
>         mov       r11, r11, lsl #1
>         str       r9, [sp, #0x68]  ; 0x68 = 104
>         str       r11, [r3, #0xC]  ; 0xC = 12
>
> Whiew!  58 instructions for the above code!
>
> Now let's see what GCC produces for the same code with the _same_
parameters
> i.e. "#ifdef __i386__" changed to "#if 1" and _ARM_ASSEM_ undefined not to
> fetch GCC's inline assembly code.  We therefore obtain:
>
> .L166:
>         ldr     lr, [r7, #16]
>         ldr     r0, [sl, #0]
>         ldr     ip, [r7, #24]
>         ldr     r3, [sl, #4]
>         smull   r4, r5, lr, r0
>         smull   r1, r2, ip, r3
>         sub     r8, r8, #16
>         add     r3, r2, r5
>         mov     r3, r3, asl #1
>         str     r3, [r8, #8]
>         ldr     lr, [r7, #24]
>         ldr     r0, [sl, #0]
>         ldr     ip, [r7, #16]
>         ldr     r3, [sl, #4]
>         smull   r4, r5, lr, r0
>         smull   r1, r2, ip, r3
>         rsb     r3, r2, r5
>         mov     r3, r3, asl #1
>         str     r3, [r8, #12]
>
> GCC emits 18 instructions for the same code in the same conditions which
is
> an obvious performance improvement.
>
> But let's have a look at the code generated by EVC:
>
> First obvious optimisation miss:
>
>         mov       r10, r5
>         mov       r10, r10, asr #31
>
> Why EVC did not use a simple sincle instruction expression like this:
>
>         mov       r10, r5, asr #31
>
> This is a sign of a suboptimal implementation of the ARM architecture.
>
> Next, why is this whole sign fixup with all operands?  Why EVC isn't using
> the signed long multiply (smull) instruction instead of umull with
separate
> manual signeness fixups?  Go figure.
>
> In my opinion this only shows that EVC is implementing the ARM
architecture
> quite poorly and no performance blasting assembly code might be expected
> from it.  At least not before someone manages to 1) convince EVC to honour
> the "inline" function specifier and 2) make it work with some sort of
inline
> assembly like GCC does.  And even then, GCC is producing better code even
> without any inline assembly as shown above.
>
> Maybe you guys should try to find a way to have GCC produce binaries
> compatible with PocketPC?
>
> Just to give you a hint, here's GCC's output for this whole do {} while
loop
> but this time with all the optimisations I recently provided turned on:
>
> First the C code:
>
>   do{
>     oX-=4;
>     XPROD31( iX[4], iX[6], T[0], T[1], &oX[2], &oX[3] ); T+=step;
>     XPROD31( iX[0], iX[2], T[0], T[1], &oX[0], &oX[1] ); T+=step;
>     iX-=8;
>   }while(iX>=in+n4);
>
> GCC's output:
>
> .L94:
>         ldr     r0, [r5, #16]
>         ldr     r1, [r5, #24]
>         ldmia   r8, {r2, r3}
>         smull   r4, ip, r0, r2
>         smlal   r4, ip, r1, r3
>         rsb     r0, r0, #0
>         smull   r4, lr, r1, r2
>         smlal   r4, lr, r0, r3
>         mov     ip, ip, asl #1
>         sub     r6, r6, #16
>         str     ip, [r6, #8]
>         mov     lr, lr, asl #1
>         str     lr, [r6, #12]
>         ldr     r0, [r8, r9]!
>         ldr     r2, [r5, #8]
>         ldr     r1, [r5, #0]
>         ldr     r3, [r8, #4]
>         smull   r4, ip, r1, r0
>         smlal   r4, ip, r2, r3
>         rsb     r1, r1, #0
>         smull   r4, lr, r2, r0
>         smlal   r4, lr, r1, r3
>         mov     ip, ip, asl #1
>         str     ip, [r6, #0]
>         sub     r5, r5, #32
>         mov     lr, lr, asl #1
>         cmp     r5, r7
>         str     lr, [r6, #4]
>         add     r8, r8, r9
>         bcs     .L94
>
> Only 30 instructions!  With the output from EVC of only half that C code
> quoted earlier we can estimate that EVC will generate over 100
instructions
> for that same piece of C code.
>
> What do you think?
>
>
> Nicolas

--- >8 ----
List archives:  http://www.xiph.org/archives/
Ogg project homepage: http://www.xiph.org/ogg/
To unsubscribe from this list, send a message to 'tremor-request at xiph.org'
containing only the word 'unsubscribe' in the body.  No subject is needed.
Unsubscribe messages sent to the list will be ignored/filtered.