[Speex-dev] Resampler saturation, blackfin performance

Sun Jun 14 16:16:18 PDT 2009

Stephane Lesage a écrit :
>> -----Message d'origine-----
>> De : Jean-Marc Valin [mailto:jean-marc.valin at usherbrooke.ca] 
>> Envoyé : dimanche, 14. juin 2009 20:46
>> À : Stephane Lesage
>> Cc : speex-dev at xiph.org
>> Objet : Re: [Speex-dev] Resampler saturation
>>
>> Just to make sure I understand, the two patches you sent are 
>> two different ways to fix the problem, with the only 
>> difference being that resample.patch converts the "unrolled 
>> by four" loop into a plain one that's easier on DSPs, right?
> 
> Yes exactly, plus a little explanation in comments.
> I really have no idea of the performance difference on x86. But I think gcc/msvc can unroll.
> Up to you. Anyway I can OVERRIDE_INNER_PRODUCT_SINGLE.

OK, I guess I'll apply resample.patch considering that we already have
an SSE version (the split in four was for SSE anyway).

> Talking about performance (still using generic version with VDSP compiler):

You'll likely see a difference just by optimising the MULT16_32_Q15() macro.

> 1. I got a pretty good boost by using a scratch buffer in SRAM.

Normally, all the data should end up in L1 cache, so it's surprising
that you see a difference with using SRAM. Are you sure your cache isn't
configured as write-through?

> 2. Wideband Encode+Decode takes 79.1 + 7.2 MIPS on my BF536 400/133 Mhz
> 3. Profiler says:
> vq_nbest                  33.05%
> vq_nbest_sign             11.12%

You should be able to get a big boost in performance just by optimising
the N=1 case for vq_nbest() and vq_nbest_sign().

> filter_mem16               4.14%

If you look at the Blackfin-optimised version, it actually uses a
different algorithm (that does 2 MACs/cycle) for this one (assuming you
place the arrays in two banks, which I don't do yet).

> inner_prod                 4.07%

Again, the Blackfin-optimised version does it with 2 MACs/cycle.

> iir_mem16                  2.75%
> qmf_synth                  2.32%
> lsp_to_lpc                 2.32%
> open_loop_nbest_pitch      1.41%
> compute_impulse_response   1.37%
> qmf_decomp                 1.28%
> lpc_to_lsp                 1.26%
> fir_mem16                  1.16%
> speex_bits_pack            1.07%
> speex_bits_unpack_unsigned 0.86%
> compute_rms16              0.79%
> 
> 4. I'm using the echo-canceller + preprocessor,
> I'd really like to improve performance here:
> - I would like to use ADI's FFT, but it's limited to powers of 2,
> is it safe to enable "Round ps_size down to the nearest power of two"  in the preproc ?

It should be (unless I broke it!). Otherwise, nothing prevents you from
doing all that processing on power-of-two frames and then doing a bit of
buffering for the codec.

> can we do the same trick with the echo-canceller for window_size ?

If you want to use the echo canceller with a power-of-two FFT, the frame
size really needs to be a power-of-two

> - are there buffers who could be placed in scratch memory ?
> (I don't see any speex_scratch_alloc inthere)

I don't understand your question.

	Jean-Marc