[Speex-dev] mdf -- better adaption of W?
speex at natvig.com
Mon Dec 5 18:28:19 PST 2005
I'm still working on visualizing the echo canceller, but I discovered
something that might be interresting.
During testing, i did this:
Generate a test signal (10+x sine waves per frame), where x increases by
one for each iteration, and wraps around at 100.
Set the speaker signal for the frame to the test signal.
Add 0.5*test signal to the mic signal.
When watching the power graph (visualized from ps in the preprocessor), I
see a large spike starting at 10 sines and moving up, then wrapping
around. It is slowly diminished, but never goes away.. It's also only much
more diminished while "moving" (slowly increasing frequency), and much
less so at the wraparound point.
This was with a tail of 5*framesize (M=5).
However, if I set the tail to M=1, the filter seems to adapt much more
quickly, and also gives much better results; the moving sine is now almost
Next test, I delayed the signal added to the mic by one frame and set M=2.
Still adapts, but does so much slower. Good.
Next test, delay the signal 3 frames, keep M=2. Complete deterioation of
state; output is just noise, and the preprocessor starts spitting out NaN
values for loudness and Zlast.
Repeat with M=5 (mic still delayed 3 frames). Adapts, but does not
completely cancel as it did earlier, and has very little cancellation for
the "edges" (when the sine wraps from 110 sines/frame back to 10
Repeat with M=5 and mic delayed 8 frames. No cancellation, as expected.
So... Next step, I skimmed through the "Multidelay Block Frequency Domain
Adaptive Filter" paper, which I understand mdf.c is based on. If I
understand this correctly:
- it keeps the frequency domain of the last M frames (in the X array)
- The "output" (the signal to cancel?), is computed by taking the
last M frequency domains, multiply each frequency band by a weight,
sum them together and inverse FFT. The weights are stored in W.
- Update W through some magic.
If I got that right, then for the 'mic delay by 3 frames', I'd expect the
W to W[3*N] to be 0 (or close to it), then W[3*N] to W[4*N] to be 0.5,
and the rest 0.
First off, it seems W is stored 'backwards'. The first values are for the
oldest frame, ok :)
However, when peeking at the values, it seems that the weights for
frame 0 (newest) are very low.
For frame 1, they are slightly positive.
For frame 2, they are fairly low, except in the specific
range of my test signal, where they range from somwhat posivie (around
0.25) to somewhat negative (-0.25).
For frame 3, they are positive all around, around the 0.5 area, but higher
in the frequency bands of my test signal.
For frame 4, they are very low, except in the range of the test signal,
where they are slightly negative.
For frame 5, they are low, but positive.
For the rest of the frames, the weights switch from "slightly positive" to
"slightly negative" -- odd index frames are positive, even index are
If I delay the signal by 4 frames instead, it wants to use
indexes 2, 4 and 6 (with emphasis on 4), with the negatives in
frames 3 and 5 (and less so in all other odd-index frames).
Looking at the negative weights closest in time to the actual echo, I see
they are more negative near the "edges" of my test signal, so it seems
they're an artifact of trying to cope with the fact that my signal jumps
in frequency every 2 seconds.
If I manually force W to be 0 all over, and 0.5 for the real parts of the
4th delayed frame, echo cancellation is perfect.
If I initialize W to the "perfect" value, it stays more or less at that
level, though it does adapt away from it every so slightly in the
frequency bands where there are no components at all in the "speaker"
.. So my question is, why doesn't W adapt to the perfect values? Is there
something that can be done to tune the adaption?
More information about the Speex-dev