[tremor] [PATCH] 12% global performance gain on a StrongARM

Sun Sep 22 09:37:53 PDT 2002

I know most of you are mostly concerned with the GCC compiler, however I
just pulled the lastest source from CVS and compiled with Microsoft's
Embedded Visual C and the latest is noticeably slower than the original
release of Tremor.  I'm guessing this is something in the eVC compiler and
some of the new optimizations.  I can't give exact numbers since all my
tests are with regard to audio embedded in a video stream and the number of
frame drops incurred, however innacurate this measurement is, for a given
stream it is very consistent and with the latest set of patches the number
of frame drops has more than doubled from the original Tremor release .
This is with a 128kbps vorbis audio.  I've run this on both an SA1100 and a
PXA250 with similar results.

If I get a chance I will try to narrow down where the performance hit is
coming from.

----- Original Message -----
From: "Nicolas Pitre" <nico at cam.org>
To: <tremor at xiph.org>
Sent: Thursday, September 19, 2002 12:52 AM
Subject: [tremor] [PATCH] 12% global performance gain on a StrongARM

<p>>
> The attached patch provides a 12% performance gain on a StrongARM SA1110
> over current code in the CVS.  This is mostly C code shuffling so to help
> GCC produce nearly perfect assembly on ARM.  Probably a hand optimized
> assembly version of mdct.c could do even better, but I'll leave this task
to
> others (Dilb?).  At least this will produce the best compiler generated
> reference to start with as well as improving performance for all
> architectures in general.
>
> So for the details, this patch does:
>
>  - Includes my previous patch with interpolation code for correct accuracy
>    with all block sizes.
>  - Interlaces sin and cos values in the lookup table to reduce register
>    pressure since only one pointer is required to walk the table instead
of
>    two.  This also accounts for better cache locality.
>  - Split the lookup table into two tables since half of it (one value
every
>    two) is only used in separate section of the code and only with large
>    block sizes.  Therefore the table size used for the common case is
reduced
>    by 2 accounting for yet better cache usage.
>  - Abstracted all cross products throughout the code so they can be easily
>    optimized.  First this prevents redundant register reloads on ARM due
to
>    the implicit memory access ordering, next this allowed for the
>    opportunity to hook some inline assembly to perform the actual
operation.
>  - Fix layout of current assembly in asm_arm.h to match GCC's output (more
>    enjoyable to read when inspecting the final assembly) plus some
>    constraint correctness issues.
>  - Added a memory barrier macro to force the compiler not to cache values
>    into registers or on the stack in some cases.
>  - Reordered some code for better ARM assembly generation by
>    the compiler.
>
> Enjoy!
>
>
> Nicolas
>

--- >8 ----
List archives:  http://www.xiph.org/archives/
Ogg project homepage: http://www.xiph.org/ogg/
To unsubscribe from this list, send a message to 'tremor-request at xiph.org'
containing only the word 'unsubscribe' in the body.  No subject is needed.
Unsubscribe messages sent to the list will be ignored/filtered.