[Speex-dev] mdf -- better adaption of W?

Mon Dec 5 18:28:19 PST 2005

Hi,

I'm still working on visualizing the echo canceller, but I discovered 
something that might be interresting.

During testing, i did this:

Generate a test signal (10+x sine waves per frame), where x increases by
one for each iteration, and wraps around at 100.

Set the speaker signal for the frame to the test signal.
Add 0.5*test signal to the mic signal.

When watching the power graph (visualized from ps in the preprocessor), I 
see a large spike starting at 10 sines and moving up, then wrapping 
around. It is slowly diminished, but never goes away.. It's also only much 
more diminished while "moving" (slowly increasing frequency), and much 
less so at the wraparound point.

This was with a tail of 5*framesize (M=5).

However, if I set the tail to M=1, the filter seems to adapt much more 
quickly, and also gives much better results; the moving sine is now almost 
gone. Odd.

Next test, I delayed the signal added to the mic by one frame and set M=2. 
Still adapts, but does so much slower. Good.

Next test, delay the signal 3 frames, keep M=2. Complete deterioation of 
state; output is just noise, and the preprocessor starts spitting out NaN 
values for loudness and Zlast.

Repeat with M=5 (mic still delayed 3 frames). Adapts, but does not 
completely cancel as it did earlier, and has very little cancellation for 
the "edges" (when the sine wraps from 110 sines/frame back to 10 
sines/frame).

Repeat with M=5 and mic delayed 8 frames. No cancellation, as expected.

So... Next step, I skimmed through the "Multidelay Block Frequency Domain 
Adaptive Filter" paper, which I understand mdf.c is based on. If I 
understand this correctly:
  - it keeps the frequency domain of the last M frames (in the X array)
  - The "output" (the signal to cancel?), is computed by taking the
    last M frequency domains, multiply each frequency band by a weight,
    sum them together and inverse FFT. The weights are stored in W.
  - Update W through some magic.

If I got that right, then for the 'mic delay by 3 frames', I'd expect the 
W[0] to W[3*N] to be 0 (or close to it), then W[3*N] to W[4*N] to be 0.5, 
and the rest 0.

First off, it seems W is stored 'backwards'. The first values are for the 
oldest frame, ok :)

However, when peeking at the values, it seems that the weights for 
frame 0 (newest) are very low.
For frame 1, they are slightly positive.
For frame 2, they are fairly low, except in the specific 
range of my test signal, where they range from somwhat posivie (around 
0.25) to somewhat negative (-0.25).
For frame 3, they are positive all around, around the 0.5 area, but higher 
in the frequency bands of my test signal.
For frame 4, they are very low, except in the range of the test signal, 
where they are slightly negative.
For frame 5, they are low, but positive.
For the rest of the frames, the weights switch from "slightly positive" to 
"slightly negative" -- odd index frames are positive, even index are 
negative.

If I delay the signal by 4 frames instead, it wants to use 
indexes 2, 4 and 6 (with emphasis on 4), with the negatives in 
frames 3 and 5 (and less so in all other odd-index frames).

Looking at the negative weights closest in time to the actual echo, I see 
they are more negative near the "edges" of my test signal, so it seems 
they're an artifact of trying to cope with the fact that my signal jumps 
in frequency every 2 seconds.

If I manually force W to be 0 all over, and 0.5 for the real parts of the 
4th delayed frame, echo cancellation is perfect.

If I initialize W to the "perfect" value, it stays more or less at that 
level, though it does adapt away from it every so slightly in the 
frequency bands where there are no components at all in the "speaker" 
signal.

.. So my question is, why doesn't W adapt to the perfect values? Is there 
something that can be done to tune the adaption?