[Tremor] Notes on Implementing Tremor on an ARM7TDMI CPU

Sat Dec 6 01:03:51 PST 2008

Hi,

I thought I'd write a little post mentioning the issues I've gone
through today to get Tremor working on this ARM7 CPU (AT91SAM7S256) and
the performance data I have gathered, in case it helps someone trying to
do something similar. Let's call it "Nicholas' Adventures in
Tremor-ARM-land".

My first "adventure" was just sheer stupidity. The programming interface
for these chips (via JTAG) sucks. You write commands and data to the
same register. Somehow, due to some kind of communication glitch
probably caused by its poor "handshaking" feature, I wrote data that was
interpreted as an instruction which set the flash lock bits on the top
half of the flash. I didn't realize it because the flash program I wrote
didn't verify what it wrote, and any data above 128KB just wasn't being
written. Once I did an "erase all", that problem was solved. I also
added a write/verify option to my flashing program.

So, I was able to get Tremor onto the CPU, compiled in "thumb" mode.
However, it would freeze up during the call to ov_open_callbacks and
never return. Again chalk this one up to stupidity. Having developed for
x86 most of my life, word alignment doesn't present itself as much of an
issue. However, unlike with x86, on ARM CPUs you don't just take a
penalty for misaligned access, it generates an exception which was not
being handled. I had to modify my static memory allocation patch so that
it always allocates data which is 4-byte aligned. This allowed the
Tremor code to run.

So, I was able to verify it works by comparing the checksum it generated
to that I generated on my workstation. Unfortunately, the performance
figures were terrible. On this 55MHz ARM7 CPU with single cycle memory
access and double cycle flash access, it took 8.5 seconds (467 million
cycles) to decode a two second, 44.1kHz 16 bit stereo Ogg Vorbis file
(not including time to open and close the file). Less than 25% of real
time. That was a lot worse than I expected. Especially since I am
compiling with -O3.

So, I decided to enable the ARM assembly code. This lead to my next
little adventure. To do this I have to compile at least the files
floor0.c, floor1.c and mdct.c without "thumb". GCC and binutils
theoretically support doing this when I enable "thumb interworking" -
i.e. the ability for thumb functions to call non-thumb functions and
vica versa. The problem, I discovered, is that to do this it generates
stub functions and they are by default linked before the rest of the
code (.text section). Normally the startup code is the first thing in
memory, but GCC puts the stubs first, and so obviously the CPU can not
boot properly.

The solution was to make the following changes to the linker scripts,
startup code and objcopy execution.

In AT91SAM7S256-RAM.ld, I changed:

.text : { *cstartup.o (.text) }>DATA =0

to:

.startup : { *cstartup.o (.startup) }>DATA =0

In AT91SAM7S256-ROM.ld I changed:

.text : { *cstartup.o (.text) }>FLASH =0

to:

.startup : { *cstartup.o (.startup) }>FLASH =0

In Cstartup.S I changed:

.text

to:

.section .startup

In the makefile, I changed:

$(OBJCOPY) -O $(FORMAT) $< $@

(which creates the .bin file from the .elf file) to:

$(OBJCOPY) --set-section-flags .startup=alloc,load,readonly,code -O
$(FORMAT) $< $@

This crazy set of changes forces the startup code to go before the stub
code. Now I can mix thumb and non-thumb code.

So, here is the amazing bit. Just compiling floor0.c, floor1.c and
mdct.c in non-thumb mode doubled performance. It took 4.25s to decode
the two seconds of data. Still nowhere near good enough, but a lot closer.

Why is this amazing? Well, because the flash on this CPU runs at a
maximum of 30MHz. That means at 55MHz core speed, it takes two cycles to
read 32 bits from flash. In thumb mode, instructions are 16 bit, so if
there are no branches, it can execute one thumb instruction per cycle.
In non-thumb mode, instructions are 32 bit, so it can only execute one
instruction every two cycles. So I would have thought thumb mode would
improve performance, due to the greater instruction throughput. I guess
not, though.

I then enabled the assembly code. It helped, but not as much as I had
hoped. Decoding two seconds worth of audio data now takes 2.75s, or
about 152 million cycles.

So I am now in the position where I need to work out how I can get this
decoding faster than real time. I chose this CPU because of this post I
read from the tremor archives:

http://lists.xiph.org/pipermail/tremor/2003-January/000303.html

I am at a loss to undetstand why Segher thinks a 40MHz ARM should be
fast enough to play back an Ogg Vorbis file. Is an ARM4 faster than an
ARM7 clock-per-clock? (I wouldn't have thought so). Perhaps he meant
with the _LOW_PRECISION_ macro defined.

I have several options to achieve the required level of performance. I
would love some feedback on the best options.

1) Compile more files without thumb. I will try this to see what happens.
2) Use _LOW_PRECISION_. I don't want to lose audio quality but I need to
get this to run in real time!
3) Overclock the CPU and/or the flash.
4) Load some tables into RAM. RAM is very tight, but it may be possible.
Can someone point to which table(s) would have the most benefit? The
sine table? I guess I'll try them and see.
5) Implement the FFT replacement for the MDCT mentioned earlier. This
could be fruitful, but will not be trivial.
6) More than one of the above in combination.
7) Anything else?

I hope this is useful for someone. Any help I can get would be appreciated!

Thanks.

Nicholas