[Tremor] Tremor ARM performance issues

Nicholas Vinen hb at x256.org
Fri Dec 5 13:05:53 PST 2008


Nicolas Pitre wrote:
> On Fri, 5 Dec 2008, Nicholas Vinen wrote:
>
>   
>> However, the trick is that the ARM assembly code which is part of Tremor
>> isn't going to work in Thumb mode. So, I'd have to compile the files
>> mdct.c, floor0.c and floor1.c without Thumb, and the rest with it.
>>
>> So my basic question is this: is the performance benefit of the assembly
>> code worth the penalty of the extra cycle per instruction for any
>> function which uses it? I have a feeling it isn't.
>>     
>
> It certainly is -- please see the performance increase results I posted 
> on this list... hmmm... a couple years ago.  The list archive must 
> certainly have them.  I don't have the ARM manual handy at the moment, 
> but I suspect you'd need several more Thumb instructions to do the same 
> as those optimized ARM assembly sequences which would spend even more 
> cycles.
>   
OK. In that case I will make an effort to use the assembly code by
compiling the routines which use it in ARM mode and the rest in Thumb mode.
>> If I avoid using the
>> assembly, and thus can compile everything in thumb mode, this also
>> avoids some annoying library issues (the C library doesn't seem to
>> support being called from both Thumb and regular ARM mode).
>>     
>
> The section of code where the assembly optimization is doesn't need to 
> call any other library functions, does it?  So you may have only the 
> mdct code in ARM mode for example.
>   
That's what I would have thought, but it uses a few library routines
such as:

/usr/libexec/gcc/arm-elf/ld: /usr/lib/gcc/arm-elf/4.1.2/thumb/libgcc.a(_divsi3.o)(__divsi3): warning: interworking not enabled.
  first occurrence: Tremor/floor0.o: arm call to thumb
/usr/libexec/gcc/arm-elf/ld: /usr/lib/gcc/arm-elf/4.1.2/thumb/libgcc.a(_udivsi3.o)(__udivsi3): warning: interworking not enabled.
  first occurrence: Tremor/floor0.o: arm call to thumb
/usr/libexec/gcc/arm-elf/ld: /usr/lib/gcc/arm-elf/4.1.2/../../../../arm-elf/lib/thumb/libc.a(lib_a-memset.o)(memset): warning: interworking not enabled.
  first occurrence: Tremor/floor0.o: arm call to thumb

Note that floor0.c is one of the three files that uses assembly code
(mdct.c, floor0.c and floor1.c). However, it's possible that these calls
do not come from functions that use assembly.

So, my plan is this: I will compile the whole thing in thumb mode, with
no assembly, at first. I haven't written the SD card driver yet so I
will compile a short ogg fragment into the code (1-5 seconds long). The
code will decode this data and calculate how long it took and display it
along with a simple checksum. This way I have a baseline performance
metric and I know the decoder works on the ARM CPU.

Then, I will take one function at a time and compile it in ARM mode and
manually insert the assembly functions somehow. I will repeat, and check
how long it took and that it doesn't change the rest. I will repeat this
for each function until I have all of the assembly-enabled functions
working. Then I can report back and confirm whether this technique works
and provides a performance benefit. I suspect it will, but I want to
make sure.

I may also experiment with (a) overclocking the CPU a bit (say to 60MHz
from 55MHz) and (b) overclocking the flash a bit (say, running it at
40MHz with 0 wait state) to see what happens. Ultimately I'm hoping I
can get better than realtime performance at 55MHz with thumb + ARM +
assembly.
> The other solution is to enable the LOW_PRECISION mode which doesn't 
> need any 64-bit computation and therefore could probably generate decent 
> Thumb assembly and also be faster and create a smaller binary due to the 
> reduced table sizes, but with a tiny artifact in the produced output.
>
>
> Nicolas
>   

OK, the whole point of me making this Ogg player in the first place was
that I wanted it to be high quality, so this is a last resort :) 
However if nothing I can do will make it run fast enough then I will try
it. I'll have to do some quality comparisons I guess, to see how much
worse it is.

By the way - I noticed there is a sine table and some other tables
compiled into the code. I'm not sure if these are left in flash or
loaded into RAM, but I suspect they stay in flash. This will slow down
access. Do you think there would be any benefit from me loading them
into RAM? I guess it depends on how heavily they are used.



Thanks for the info! I have to do some work today but I'll see if I can
find enough time to do some of this experimentation and report back.



Nicholas

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.xiph.org/pipermail/tremor/attachments/20081206/a1928ba6/attachment.htm 


More information about the Tremor mailing list