[Speex-dev] Major internal changes, TI DSP build change

Sat Apr 22 19:50:49 PDT 2006

Jean-Marc,

>> >I fixed it in svn. Could you check that?
>>
>> Now all platforms match again.  Note that the measured SNR for this test
>> sample is lower than with the broken code (10.87 vs 11.10), but of course
>> this is no way to judge the real quality.
>
> SNR, especially on a single sample, can be very misleading. Yet, could
> you just check that the DSP results match what you get on a PC?

I do not have a build environment for a PC.  I have been using the 6-second 
test file male.wav from the Speex site for my simulations, if someone else 
wants to run the audio through the encoder and decoder at 8kbps, complexity 
1.  I  might be able to get a coworker to do this, but not any time soon.

>> >Does the C55 have a 32x16 multiplier or do you mean it handles my
>> >emulation of it well?
>>
>> I has two ALUs with 17x17 bit MACs, and it has an instruction that does
>> this:
>> ACy = M40(rnd((ACx >> #16) + (uns(Xmem) * uns(Ymem))))
>>
>> I never quite understood this, so I went of and looked at the manuals. 
>> It
>> can multiply the low half in one cycle, then shift and add it to the high
>> half in a second cycle.  And, in a type loop the parallel ALUs would 
>> allow
>> one 32x16 multiply per cycle.
>
> Just one thing I'd like to understand. Did you do some tricks and/or
> assembly to implement the MULT16_32_Q* routines with these instructions
> or does the compiler figure them out by itself?

No, I have done no assembly work on any of these DSPs.  It has been a few 
years since I did assembly work on any DSP, and it does not look like I will 
need to for my applications.  I just found the above instruction in the 
instruction set reference manual, and it seems perfect for 16x32 multiplies. 
When I look at the assembler output for filter.c, I do not see this 
instruction used, probably because there is always some shift in the result 
(like MULT_16_32_Q15, which takes 6 instructions to implement: two 
multiplies, two adds, a shift, and a store).  So, never mind.

>> The C54x cannot do this, and uses library calls for 32x16 multiplies.
>
> Why is that? By default all the 32x16 multiplies are computed using only
> 16x16 multiplies (see fixed_generic.h).

Once again, I spoke to soon.  I saw the library calls when I first tested 
the C54x last year, but I do not see them now.  I am using a later version 
of the TI compiler, and there could be some different compile options.

>> The
>> changes that you have made since 1.1.8 are most dramatic for the 54x, 
>> which
>> dropped from 184 (unusable in real time, the fastest parts are 160 MHz) 
>> to
>> 79 MIPs.  The C55x dropped from 41.5 to 29.4 MIPs (mixed 16/32 bit
>> capability), and the C6x dropped slightly from 36 to 34.5 MIPs (32bit
>> machine).
>
> Glad it makes such a difference. I'm just surprised that the C6x
> complexity is that high.

There was a post from Jerry Trantow on 4-Feb that he had cut the C6x MIPs 
about in half with some assembly optimization (do you know if he planned to 
submit these?).  Because this is a very parallel machine, it is not an 
assembly language for the faint of heart.

- Jim