<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">
</head>
<body bgcolor="#ffffff" text="#000000">
Timmy Brolin wrote:
<blockquote cite="mid:493A9FEB.4030507@home.se" type="cite">
<blockquote type="cite">
<pre wrap="">According to WikiPedia, the Nintendo DS has the same ARM7TDMI core as
what I am using. I just overclocked mine to very close to 66MHz and so
far it seems to be working fine. In fact, from what I can tell, aside
from your CPU likely reading code out of RAM rather than flash, they
seem identical. No cache, same core, etc. So, it may be that executing
out of RAM makes all the difference.
</pre>
</blockquote>
<pre wrap=""><!---->Note that the Nintendo DS has two processors. One ARM9 and one ARM7TDMI.
</pre>
</blockquote>
Ah, OK, interesting.<br>
<blockquote cite="mid:493A9FEB.4030507@home.se" type="cite">
<blockquote type="cite">
<pre wrap="">So far this CPU seems stable @ 66MHz (w/ flash @ 33MHz). I may be able
to push it more. I'd rather not rely on that if possible, though.
I really think that if it's 75% real time, I can get it to be faster
than real time, but without a better idea of which routines are the most
critical I agree it's going to be tough. I'll try to get profiling
working again.
</pre>
</blockquote>
<pre wrap=""><!---->Since your flash memory is single cycle for thumb, and double cycle for
ARM, I would suggest you put your ARM code in RAM and keep the thumb
code in flash. That is the typical arrangement on the Nintendo gameboy
advance which has a 16bit flash, and 32bit single cycle RAM.
The ARM assembly optimized routines should get a nice performance boost
if you move them from flash to single cycle RAM.
Timmy Brolin
</pre>
</blockquote>
You are *spot on* with this comment. The final tweak I made to the
code, which gave a massive performance boost, was to put the following
functions in RAM by moving them into the .data section:<br>
<br>
decode_packed_entry_number<br>
decode_map<br>
vorbis_book_decodevv_add<br>
_checksum<br>
mdct_backwards<br>
mdct_shift_right<br>
mdct_unroll_*<br>
<br>
This cost me about 4K of RAM, which is an acceptable amount considering
the 30-40% reduction in cycles this gives.<br>
<br>
How did I know to move these? Well, I'm using Tremolo now (i.e. version
of Tremor with more ARM assembly) and the author - Robin Watts - very
nicely provided a profile of the code. It shows that these functions
combined account for something like 75% of CPU time - at least in this
version of the code running on an ARM processor.<br>
<br>
It's now using 87.5% of CPU to decode a 44.1kHz 16 bit stereo file with
the processor running at stock speed :)<br>
<br>
I'd like to improve on this a bit, to increase the chances of being
able to do "seamless playback" which requires that I can close a file,
open the next one and decode the first packet before the audio buffer
runs out. I don't feel bad overclocking the CPU a bit anyway since I
built the power supply and know that it's going to provide better than
the minimum voltage requirements, but the more cycles I can shave off
the code by fiddling with it the less I have to push it.<br>
<br>
<br>
Thanks to all for your help. I still have a lot of tweaking to do but
when it's all working nicely I'll upload a tar.gz somewhere and post a
message here so that anybody who wants to run Tremor on this processor
can get a head start. It's quite a nice little chip, just enough power
and memory to do Ogg Vorbis decoding at common bit rates/sample rates
combined with low power usage and very useful peripherals - e.g. SPI
w/DMA for interfacing with MMC/SD cards and I2S w/DMA for interfacing
with a DAC.<br>
<br>
<br>
Nicholas<br>
<br>
</body>
</html>