<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

  <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">

</head>

<body bgcolor="#ffffff" text="#000000">

Nicolas Pitre wrote:

<blockquote cite="mid:alpine.LFD.2.00.0812050955400.14328@xanadu.home"

 type="cite">

  <pre wrap="">On Fri, 5 Dec 2008, Nicholas Vinen wrote:

  </pre>

  <blockquote type="cite">

    <pre wrap="">However, the trick is that the ARM assembly code which is part of Tremor

isn't going to work in Thumb mode. So, I'd have to compile the files

mdct.c, floor0.c and floor1.c without Thumb, and the rest with it.

So my basic question is this: is the performance benefit of the assembly

code worth the penalty of the extra cycle per instruction for any

function which uses it? I have a feeling it isn't.

    </pre>

  </blockquote>

  <pre wrap=""><!---->

It certainly is -- please see the performance increase results I posted 

on this list... hmmm... a couple years ago.  The list archive must 

certainly have them.  I don't have the ARM manual handy at the moment, 

but I suspect you'd need several more Thumb instructions to do the same 

as those optimized ARM assembly sequences which would spend even more 

cycles.

  </pre>

</blockquote>

OK. In that case I will make an effort to use the assembly code by

compiling the routines which use it in ARM mode and the rest in Thumb

mode.<br>

<blockquote cite="mid:alpine.LFD.2.00.0812050955400.14328@xanadu.home"

 type="cite">

  <blockquote type="cite">

    <pre wrap="">If I avoid using the

assembly, and thus can compile everything in thumb mode, this also

avoids some annoying library issues (the C library doesn't seem to

support being called from both Thumb and regular ARM mode).

    </pre>

  </blockquote>

  <pre wrap=""><!---->

The section of code where the assembly optimization is doesn't need to 

call any other library functions, does it?  So you may have only the 

mdct code in ARM mode for example.

  </pre>

</blockquote>

That's what I would have thought, but it uses a few library routines

such as:<br>

<br>

<pre>/usr/libexec/gcc/arm-elf/ld: /usr/lib/gcc/arm-elf/4.1.2/thumb/libgcc.a(_divsi3.o)(__divsi3): warning: interworking not enabled.

  first occurrence: Tremor/floor0.o: arm call to thumb

/usr/libexec/gcc/arm-elf/ld: /usr/lib/gcc/arm-elf/4.1.2/thumb/libgcc.a(_udivsi3.o)(__udivsi3): warning: interworking not enabled.

  first occurrence: Tremor/floor0.o: arm call to thumb

/usr/libexec/gcc/arm-elf/ld: /usr/lib/gcc/arm-elf/4.1.2/../../../../arm-elf/lib/thumb/libc.a(lib_a-memset.o)(memset): warning: interworking not enabled.

  first occurrence: Tremor/floor0.o: arm call to thumb

</pre>

Note that floor0.c is one of the three files that uses assembly code

(mdct.c, floor0.c and floor1.c). However, it's possible that these

calls do not come from functions that use assembly.<br>

<br>

So, my plan is this: I will compile the whole thing in thumb mode, with

no assembly, at first. I haven't written the SD card driver yet so I

will compile a short ogg fragment into the code (1-5 seconds long). The

code will decode this data and calculate how long it took and display

it along with a simple checksum. This way I have a baseline performance

metric and I know the decoder works on the ARM CPU.<br>

<br>

Then, I will take one function at a time and compile it in ARM mode and

manually insert the assembly functions somehow. I will repeat, and

check how long it took and that it doesn't change the rest. I will

repeat this for each function until I have all of the assembly-enabled

functions working. Then I can report back and confirm whether this

technique works and provides a performance benefit. I suspect it will,

but I want to make sure.<br>

<br>

I may also experiment with (a) overclocking the CPU a bit (say to 60MHz

from 55MHz) and (b) overclocking the flash a bit (say, running it at

40MHz with 0 wait state) to see what happens. Ultimately I'm hoping I

can get better than realtime performance at 55MHz with thumb + ARM +

assembly.<br>

<blockquote cite="mid:alpine.LFD.2.00.0812050955400.14328@xanadu.home"

 type="cite">

  <pre wrap="">

The other solution is to enable the LOW_PRECISION mode which doesn't 

need any 64-bit computation and therefore could probably generate decent 

Thumb assembly and also be faster and create a smaller binary due to the 

reduced table sizes, but with a tiny artifact in the produced output.

Nicolas

  </pre>

</blockquote>

<br>

OK, the whole point of me making this Ogg player in the first place was

that I wanted it to be high quality, so this is a last resort :)&nbsp;

However if nothing I can do will make it run fast enough then I will

try it. I'll have to do some quality comparisons I guess, to see how

much worse it is.<br>

<br>

By the way - I noticed there is a sine table and some other tables

compiled into the code. I'm not sure if these are left in flash or

loaded into RAM, but I suspect they stay in flash. This will slow down

access. Do you think there would be any benefit from me loading them

into RAM? I guess it depends on how heavily they are used.<br>

<br>

<br>

<br>

Thanks for the info! I have to do some work today but I'll see if I can

find enough time to do some of this experimentation and report back.<br>

<br>

<br>

<br>

Nicholas<br>

<br>

</body>

</html>