preprocessor performance (was Re: [speex-dev] Memory leak in denoiser + a few questions)

Wed Mar 31 11:49:04 PST 2004

Jean-Marc Valin wrote:

>OK, so the problem doesn't seem to be the VAD specifically. Can you tell
>me how much audio you had in the test? It may be that nothing's wrong
>and the code just isn't so fast that you can do 100 channels. Or maybe
>it just needs a bit of optimization...
>  
>

In my test, I have a buffer which is 1024x1024 (about 1Million, or 65 
seconds) samples long, which I zero and then fill with 537760 (about 
500K, or 30 seconds) of sampled audio.  The rest of the buffer is empty.

Then, I run the preprocessor over it 5 times;  This simulates about 5 
minutes of preprocessing, consisting of alternating 30 second segments 
of speech and silence.

I sent (off-list) some oprofile output, but I'm not sure what to make of 
it.  Some operations that don't look any more complicated than others 
seem to take a long time.  I also tried getting samples on 
DATA_CACHE_MISSES.  Here's an example of the hotspots I found (in 
preprocessor.c, code modified a bit to include local pointers to arrays 
in the st struct):

The first four columns are the counter hits and percentage of hits for 
CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask 
of 0x00 (No unit mask) count 10000 and  DATA_CACHE_MISSES events (Data 
cache misses) with a unit mask of 0x00 (No unit mask) count 1000 
respectively.  The hits attributed to inc %ebx might be due to the 
previous instruction, though, but clearly this loop itselff is taking 
almost 7% of the time, which doesn't make sense..

<p>                                :   for (i=1;i<N;i++)
                               : 804a340:       mov    $0x1,%ebx
    18  0.0012     0 0.0e+00   : 804a345:       cmp    %edi,%ebx
                               : 804a347:       jge    804a377 
<speex_preprocess+0x3c7>
                               : 804a349:       fldl   0x804d810
    11 7.2e-04     0 0.0e+00   : 804a34f:       fldl   0x804d818
                            :      zeta[i] = .7*zeta[i] + .3*prior[i];
                               : 804a355:       mov    0xffffffb4(%ebp),%ecx
  1494  0.0979     1  0.0695   : 804a358:       mov    0xffffffac(%ebp),%eax
    22  0.0014     0 0.0e+00   : 804a35b:       fld    %st(1)
                               : 804a35d:       fld    %st(1)
  1546  0.1013     1  0.0695   : 804a35f:       fxch   %st(1)
  1532  0.1004     0 0.0e+00   : 804a361:       fmuls  (%ecx,%ebx,4)
     1 6.6e-05     0 0.0e+00   : 804a364:       fxch   %st(1)
     8 5.2e-04     0 0.0e+00   : 804a366:       fmuls  (%eax,%ebx,4)
  1416  0.0928     9  0.6254   : 804a369:       faddp  %st,%st(1)
     1 6.6e-05     0 0.0e+00   : 804a36b:       fstps  (%ecx,%ebx,4)
102158  6.6924    15  1.0424   : 804a36e:       inc    %ebx
  5864  0.3842     0 0.0e+00   : 804a36f:       cmp    %edi,%ebx
  1564  0.1025     0 0.0e+00   : 804a371:       jl     804a355 
<speex_preprocess+0x3a5>
                               : 804a373:       fstp   %st(0)
   144  0.0094     0 0.0e+00   : 804a375:       fstp   %st(0)

<p>Here, this area of the code is taking (in this example) about 13% of the 
execution time:

                               :         zeta1 = zeta[i];
                               :      else
                               :         zeta1 = .25*zeta[i-1] + 
.5*zeta[i] + .25*zeta[i+1];
                               : 804a490:       mov    0xffffffb4(%ebp),%edx
  4292  0.2812     0 0.0e+00   : 804a493:       fldl   0x804d868
   287  0.0188     0 0.0e+00   : 804a499:       flds   (%edx,%ebx,4)
146543  9.6001    26  1.8068   : 804a49c:       fxch   %st(1)
 28942  1.8960     3  0.2085   : 804a49e:       fmuls  
0xfffffffc(%edx,%ebx,4)
  9996  0.6548     1  0.0695   : 804a4a2:       fxch   %st(1)
                               : 804a4a4:       fmuls  0x804d708
  1655  0.1084     0 0.0e+00   : 804a4aa:       faddp  %st,%st(1)
  1030  0.0675     1  0.0695   : 804a4ac:       fldl   0x804d868
   657  0.0430     0 0.0e+00   : 804a4b2:       fmuls  0x4(%edx,%ebx,4)
   553  0.0362     0 0.0e+00   : 804a4b6:       faddp  %st,%st(1)
  1129  0.0740     0 0.0e+00   : 804a4b8:       fstps  0xffffffe4(%ebp)
 53350  3.4950     3  0.2085   : 804a4bb:       flds   0xffffffe4(%ebp)

<p>I see that there's probably some optimizations that could be made when 
using the preprocessor only for VAD; the reverse fft and writing back 
results, etc could certainly be skipped, since if only VAD is enabled, 
then there's no point in modifying the samples.  But, that isn't the 
bulk of the consumption, assuming that what oprofile is telling me is 
even close to correct.

<p>--- >8 ----
List archives:  http://www.xiph.org/archives/
Ogg project homepage: http://www.xiph.org/ogg/
To unsubscribe from this list, send a message to 'speex-dev-request at xiph.org'
containing only the word 'unsubscribe' in the body.  No subject is needed.
Unsubscribe messages sent to the list will be ignored/filtered.