[Tremor] Notes on Implementing Tremor on an ARM7TDMI CPU
Nicholas Vinen
hb at x256.org
Sat Dec 6 16:44:55 PST 2008
Segher Boessenkool wrote:
>> Why is this amazing? Well, because the flash on this CPU runs at a
>> maximum of 30MHz. That means at 55MHz core speed, it takes two cycles to
>> read 32 bits from flash. In thumb mode, instructions are 16 bit, so if
>> there are no branches, it can execute one thumb instruction per cycle.
>> In non-thumb mode, instructions are 32 bit, so it can only execute one
>> instruction every two cycles. So I would have thought thumb mode would
>> improve performance, due to the greater instruction throughput. I guess
>> not, though.
>
> ARM instructions do more work per instruction than Thumb insns; and
> they can access more registers more freely. Thumb is also harder for
> GCC to generate good code for. Thumb2 is better, but you don't have
> that.
I guessed that it was the fact it can do more work per instruction that
was making up for the slower code fetches. Your other comments are
probably part of it too.
>>
>> I am at a loss to undetstand why Segher thinks a 40MHz ARM should be
>> fast enough to play back an Ogg Vorbis file.
> http://lists.xiph.org/pipermail/tremor/2003-January/000303.html
>
> That was an estimation, as should be obvious. Your 55MHz device with
> slow memory can almost do it, so it was a pretty good estimation if I
> say so myself :-)
Well, it turns out you are right. I wouldn't call it "slow memory", but
when I chose this processor to be honest I didn't notice that flash
didn't run at full speed. There are still a lot of things I don't know
about the processor even after having read the data sheet quite
extensively. Luckily, the critical functions are small enough that I can
fit them in RAM. I'm not using much more than 40MHz to decode the file
now, and with some tweaking will probably meet that figure :) It just
took a lot more work than I expected based on your comment, to get it to
this performance level. Still, it *was* a good guess.
>> I have several options to achieve the required level of performance. I
>> would love some feedback on the best options.
>>
>> 1) Compile more files without thumb. I will try this to see what
>> happens.
>
> This probably only helps for the computationally heavy routines.
Indeed. Compiling vorbisfile without thumb slows it down. However things
like mdct and floor0/floor1 clearly go faster without thumb. My next set
of performance tweaks will probably involve finer granularity of
thumb/non-thumb compilation in order to determine the optimal
instruction set for each major function. I think functions which mostly
shuffle data around will benefit from thumb, but those that do
computations will not.
>> 2) Use _LOW_PRECISION_. I don't want to lose audio quality but I need to
>> get this to run in real time!
>
> Try it out and see how bad it really is.
I meant _LOW_ACCURACY_ of course. I tried it and it got slower.
>> 3) Overclock the CPU and/or the flash.
>
> Bad plan. Is this an external flash though? You should be able to
> get faster flash than that 33MHz.
It's a system-on-a-chip. The DAC is external but that's about it. From
what I've read, people have gotten the flash up to 48MHz. I really truly
think Atmel are being conservative here. They have to take account of
poor power supplies, inadequate bypassing, high and low temperatures,
EMI-rich environments, etc. I know that the power supply is good, and
that temperatures are not going to reach extremes, etc. Still as I said
in an earlier post, right now performance is good enough that
overclocking isn't necessary but I've no doubt that there's a
significant amount of room to do so with this processor should I run
into a performance wall.
Note that processors running on the same core are available at 80,
90MHz. This one is tweaked for low power consumption, so it's
understandable to imagine that this is why they've given it a much more
conservative rating, but I honestly think it's capable of stable
operation at higher frequencies with a well designed board. Pushing the
core speed up 10% doesn't run the flash any faster than it's rated, and
I bet it'll run at 80MHz OK as long as the environment is benign.
>> Can someone point to which table(s) would have the most benefit? The
>> sine table? I guess I'll try them and see.
> 4) Load some tables into RAM. RAM is very tight, but it may be possible.
>
> The FFT/MDCT twiddles and window are a good place to start.
You're absolutely right. As I said in my previous message, loading
mdct_backwards, decode_packed_entry_number and friends into RAM was a
massive improvement.
>
>> 5) Implement the FFT replacement for the MDCT mentioned earlier. This
>> could be fruitful, but will not be trivial.
>> 6) More than one of the above in combination.
>> 7) Anything else?
>
> Measure. You cannot solve a performance problem (or any other problem)
> if you don't know what the problem _is_.
Luckily for me Robin Watts already did the measurements :)
We all solve problems differently. I measure when I can, but if it's too
difficult - and on an embedded system it isn't easy - I chose to
experiment, and pick the collective minds on this mailing list. I think
it paid off. I'm sure it's possible to do even better with more
information, but look, I spent 3-4 days just building the hardware and
software to flash the chip. I'm in no mood to spend more days building
tools or setting up emulators, even if that is in fact the sensible
thing to do.
>
> You really should get a development board with external RAM, so you can
> run bigger code during development than you would for deployment, and
> so your turnaround time is a few seconds instead of 20 minutes. You do
> need to experiment a lot to get the best performance (or, very good
> performance, anyway).
Yeah, that'd be extremely nice. I'll probably eventually design a
development board. It depends on how well this emulator (SkyEye) works.
I haven't worked out how to configure it yet but eventually I will.
There's the AT91SAM7S-EK evaluation kit from Atmel. If I were less of a
lazy cheapskate I would have bought one already.
Next time I design something with an ARM I'll probably use the LPC2387.
It likely uses more power than the AT91SAM7S but has 98K of built-in
SRAM, 512K flash, runs at 72MHz and only costs US$10 each. The biggest
benefit will be the additional RAM. However, I suspect the 64K I have
now will just do.
>
> Good luck,
>
>
> Segher
>
Thanks for all your help!
Nicholas
More information about the Tremor
mailing list