[Tremor] TI55xx implementation: stuck

Sat Oct 23 18:09:39 PDT 2004

On Sat, Oct 23, 2004 at 08:59:01PM +0200, Roland Wintersteller wrote:
> Let me briefly sum up (and give some additional) facts of this
> discussion:
> 
> - You have had problems on porting Tremor to C5x DSP, but now it works.

It always worked, it just wasn't very fast before writing alot of
assembly, yes :-)

> - You were the first guy who achieved to make Ogg Vorbis run on the C5x
> DSP,

First that I know of, there may have been others.  To be pedantic, I
participated in the port effort; I was not the only engineer working
on it.

> but in the meanwhile several others (including Johannes Sandvall
> and me) got Ogg Vorbis work on C5x too.

Yes, and the other Neuros Audio engineers took over maintaining the
port the Neuros after the initial port, and have made apparently
substantial improvements to it since I let go of it.

> - An ARM uC achieves better (faster) results than the C5x. We should not
> forget that the ARM is a 32bit micro processor.

The ARM is also running at a fraction of the clock speed.  There are
three things that make the ARM somewhat better suited to Tremor:

1) 32 bit math-- not actually as big a deal as it seems, but it makes
things easier on the compiler and developer.

2) Shifter on ALU inputs (not outputs).  This is more useful than
you'd think; it makes implementing true floating point relatively easy
and also eases fixed point math substantially.  It also comes in handy
during bit-slicing operations during packet decode.

3) ARM typically uses the on-core SRAM as a zero-wait cache for slower
(7-14 wait) off-board memory.  I expect some 5xxx can do this too, but
it's a less common arrangement if so.  The TI chips tend to have much
more SRAM on the core and designers tend to choose a such a chip with
the intent of putting everything in on-core storage.  An ARM is
usually paired with a small (where small is 8-16 megabit) offboard
DRAM.

> As the most frequently
> used multiply operations I have seen in the sources are 32bit x 32bit =
> 64bit>>32 = 32bit

Yes, although really only 24x24->48 >> 24 bit depth is needed.

> and the C5x DSP only supports 16bit x 16bit =
> 32bit>>16 = 16bit, that means, that the ARM is expected to be 4 times
> (400%) faster (compared to C5x assembler code). You only saw a
> performance gain of 12%. 

Don't forget the statements inbetween needed to glue; you have
only... two? real registers on TI as well.  Memory is fuzzy here.  ARM
addressing is more flexible as well, don't discount that.  OTOH, most
ARMs don't do 1 cycle multiplies.  So, the comparison is complicated.

The TI chips can do some vectorized math, but I've not been able to
arrange it in a way to make use of the available vectorization.  A TI
guru could probably manage it though.

> - TI C5x compiler is not able to implement a 32bit x 32bit multiply in 4
> cycles which is able in assembler. 

Right.  It is also missing intrinsics for doing so.  Actually, you
only need 3 multiplies if you know you're throwing away low bits.

> On the other side the ARM compiler is
> probably not able to do a 64bit by 64bit multiply without a call to
> stdlib, which is the same considering the bit depth. 

I can only comment on GCC, but GCC can in fact do a 32x32->64 in one
insn.  The ARM Consortium paid Cygnus to write a first-class GCC
backend for ARM and I can attest to it being pretty well tuned.  GCC's
way of doing inline assembly also adds substantial convenience.

> - TI C5x is currently the only 16bit CPU which is able to decode Ogg
> Vorbis. 

That I am uncertain of, but it could well be true.

Monty