[Tremor] Notes on Implementing Tremor on an ARM7TDMI CPU

Nicholas Vinen hb at x256.org
Sat Dec 6 02:40:11 PST 2008


Andrew Lentvorski wrote:
> Nicholas Vinen wrote:
>
>> So I am now in the position where I need to work out how I can get this
>> decoding faster than real time. I chose this CPU because of this post I
>> read from the tremor archives:
>>
>> http://lists.xiph.org/pipermail/tremor/2003-January/000303.html
>>
>> I am at a loss to undetstand why Segher thinks a 40MHz ARM should be
>> fast enough to play back an Ogg Vorbis file. Is an ARM4 faster than an
>> ARM7 clock-per-clock? (I wouldn't have thought so).
>
> Note he said ARMv4.  That refers to the version of the Instruction Set
> Architecture--not chip.
>
> See here for the distinction:
> http://en.wikipedia.org/wiki/ARM_architecture
>
> Don't ARM9's have separate instruction and data buses?  That means
> that you can do instruction fetch and data fetch on the same cycle.  I
> don't believe that is true on the ARM7.
>
> That's going to kill, if true--especially if fetching from flash ties
> things up for 2 cycles and can't interleave with a RAM access. 
> Vorbis, IIRC, assumes pretty good lookup speed in order to access the
> basis vector tables.  You may also find that caching and calculating
> is better than looking stuff up.
>
> Also, I'm pretty sure that the ARM9 has bypassing that prevents cycle
> stalls while data is getting readied.
>
> Finally, I think the ARM9 handles multiplies quite a bit faster.
>
Ah, thanks. This is an ARM7TDMI which is listed as an ARMv4T. I'm not
srue what the T stands for, but it seems to have less MIPS per MHz.
However this chips is also running quite a bit above 40MHz.

55MHz * 0.73 = 40.15MIPS. So it should be *roughly* equivalent to an
ARMv4 @ 40MHz. However I'm not managing to get real time performance
yet. The best I can get is around 75% of real time so far.

The thing is, I tried putting a bunch of tables in RAM, which should
solve the flash issue - it can fetch simultaneously from RAM AFAIK, and
it's single cycle. But it got slower! I don't know what's going on...

>> 1) Compile more files without thumb. I will try this to see what
>> happens.
>
> Worth a shot.  If thumb doesn't help your memory bandwidth, better to
> take the extra registers.
>
I tried compiling all files without thumb, it got slightly slower. I
will try files individually, but it take about 20 minutes to upload the
flash each time so experimentation is slow.
>> 7) Anything else?
>
> It would be useful to actually *know* where it's spending all its
> time.  Can you run this under a simulator?
Yes, it would. I'd like to use a simulator, the main problem is I'd
likely have to modify the code since I doubt the simulator will simulate
the peripherals etc. I don't know where to get a simulator. I assume
they exist...

I'd like to do some profiling on x86 as that would give me a clue but I
had trouble getting gcc -pg to output valid information.
>
> I thought that Skyeye supported the AT91 series.  If you could run it
> under that, you might get some better information.
>
>
>
> As a reference, I ported the Tremor library to the 66MHz ARM9 on the
> Nintendo DS.  Once I enabled the assembly language implementation, I
> didn't have any performance issues.

Interesting, it seems to also be an ARMv4T. I wonder why I'm having
issues then.
>
> Now, I don't know how much extra performance room I had left, but it
> was capable of decoding, streaming from Wifi, writing a raster to the
> display, and reading off of a flash device without me having to do
> anything strange to get performance.
>
> -a
>
Well, that sounds like plenty of performance. I will keep tweaking.
There are just so many variables and it takes ages to try them all. For
all I know, -O2 or -Os may give me a big boost over -O3.


Thanks. I will continue investigating.



Nicholas




More information about the Tremor mailing list