[Speex-dev] Resampler saturation, blackfin performance
jean-marc.valin at usherbrooke.ca
Sun Jun 14 16:16:18 PDT 2009
Stephane Lesage a écrit :
>> -----Message d'origine-----
>> De : Jean-Marc Valin [mailto:jean-marc.valin at usherbrooke.ca]
>> Envoyé : dimanche, 14. juin 2009 20:46
>> À : Stephane Lesage
>> Cc : speex-dev at xiph.org
>> Objet : Re: [Speex-dev] Resampler saturation
>> Just to make sure I understand, the two patches you sent are
>> two different ways to fix the problem, with the only
>> difference being that resample.patch converts the "unrolled
>> by four" loop into a plain one that's easier on DSPs, right?
> Yes exactly, plus a little explanation in comments.
> I really have no idea of the performance difference on x86. But I think gcc/msvc can unroll.
> Up to you. Anyway I can OVERRIDE_INNER_PRODUCT_SINGLE.
OK, I guess I'll apply resample.patch considering that we already have
an SSE version (the split in four was for SSE anyway).
> Talking about performance (still using generic version with VDSP compiler):
You'll likely see a difference just by optimising the MULT16_32_Q15() macro.
> 1. I got a pretty good boost by using a scratch buffer in SRAM.
Normally, all the data should end up in L1 cache, so it's surprising
that you see a difference with using SRAM. Are you sure your cache isn't
configured as write-through?
> 2. Wideband Encode+Decode takes 79.1 + 7.2 MIPS on my BF536 400/133 Mhz
> 3. Profiler says:
> vq_nbest 33.05%
> vq_nbest_sign 11.12%
You should be able to get a big boost in performance just by optimising
the N=1 case for vq_nbest() and vq_nbest_sign().
> filter_mem16 4.14%
If you look at the Blackfin-optimised version, it actually uses a
different algorithm (that does 2 MACs/cycle) for this one (assuming you
place the arrays in two banks, which I don't do yet).
> inner_prod 4.07%
Again, the Blackfin-optimised version does it with 2 MACs/cycle.
> iir_mem16 2.75%
> qmf_synth 2.32%
> lsp_to_lpc 2.32%
> open_loop_nbest_pitch 1.41%
> compute_impulse_response 1.37%
> qmf_decomp 1.28%
> lpc_to_lsp 1.26%
> fir_mem16 1.16%
> speex_bits_pack 1.07%
> speex_bits_unpack_unsigned 0.86%
> compute_rms16 0.79%
> 4. I'm using the echo-canceller + preprocessor,
> I'd really like to improve performance here:
> - I would like to use ADI's FFT, but it's limited to powers of 2,
> is it safe to enable "Round ps_size down to the nearest power of two" in the preproc ?
It should be (unless I broke it!). Otherwise, nothing prevents you from
doing all that processing on power-of-two frames and then doing a bit of
buffering for the codec.
> can we do the same trick with the echo-canceller for window_size ?
If you want to use the echo canceller with a power-of-two FFT, the frame
size really needs to be a power-of-two
> - are there buffers who could be placed in scratch memory ?
> (I don't see any speex_scratch_alloc inthere)
I don't understand your question.
More information about the Speex-dev