[speex-dev] Re: Preprocessing and Echo Cancellation Notes.
Jean-Marc.Valin at USherbrooke.ca
Sat Nov 8 22:00:05 PST 2003
> 1) AGC: This seems to work pretty well in all cases. I had previously
> hacked-in the "compander" filter from sox for a similar effect. What
> I've noticed is that speex_preprocess's AGC has no "knobs", and it
> seems to use an attack/decay that is a lot faster than what I had
> chosen from the sox compander, but it works pretty well nonetheless. I
> think your choices may have been better. It's amazing how little
> difference I can hear now regardless of how I have my microphone gain
> set, from about 10% to 90% gain.
Well, good to know that it works.
> 2) VAD: I never had a good VAD implementation in the library; I had a
> user-configurable audio energy threshold that did this, plus, I had a
> hokey algorithm where I did a pretty naive estimate of the noise floor,
> and then considered anything 5dB above that to be speech. This worked
> OK, but since I never updated my "noise floor" estimate, it was easily
> broken if there was additional noise added at any time (i.e. the user
> raised their microphone gain). Here, I have gone in and adjusted some
> knobs here:
> /* if (st->speech_prob> .35 || (st->last_speech < 20 &&
> st->speech_prob>.1)) */
> if (st->speech_prob> .30 || (st->last_speech < 20 &&
Well, the tuning always depends on what you're trying to achieve.
Currently, the VAD is mostly tuned to make sure it doesn't start
> to make it more sensitive, because I was getting some missed speech,
> and some dropouts. The dropouts were especially troubling, because
> they caused a big degradation in speech in some cases. The second
> parameter helped a bit in this case, but I think there might be a
> smarter implementation yet -- like immediately lowering the threshold
> once speech is detected, and then raising it gradually based on the
> previous probabilities?
There's probably lots of improvements that can be done...
> I had also experimented with the 3GPP AMR VAD code (which is, of
> course, copyrighted) to see how it compares, and it was still better
> than speex, but speex is still pretty good.
Well, if this VAD was able to beat the 3GPP VAD, then some people would
probably lose their job :)
> a) The most interesting thing it does is sometimes it also de-voices
> speech. I.e. if you say "aaaaaaa" into the filter, after about 3
> seconds, you're voice just disappears :). I thought this was
> interesting, and I wanted to see how smart it was, so instead of a
> single vowel sound, I tried repeating vowel-consonant pairs, like
> "badumpbadumpbadump", and If I was consistent enough with that, I could
> make them mostly disappear as well. This was lots of fun. What it
> points out, though, is that denoising and, say, singing, won't go along
> very well at all! I'm also wondering if it could be used to cancel out
> a boring speaker :)
Well, what you observe is the effect of noise adaptation. If (in
general) a signal is stationary, there's no real way to differentiate it
from noise... On easy way to solve the problem though is simply to
increase the time over which the signal needs to be stationary to be
considered as noise.
> b) There are some "musical" artifacts left over. They're not huge,
> but I did notice them as voices faded out, etc. I'm guessing this is
> de-noising, but I was using denoise + AGC at the time, so I'm not sure;
> if AGC is just scaling, then I guess it must be the denoise. I'll
> probably add options to my UI to individually control the different
> filters, which will make evaluation easier.
Musical noise is something that most (all?) denoisers have at different
> Finally, echo cancellation. I haven't actually been able to get the
> echo canceller to do anything really useful for me. I'm currently
> using it something like this:
> ec = speex_echo_state_init(160, 500); /* in ms */
Actually, the second parameter is in samples, since there's no way to
tell the sampling rate.
> I've also tried to use it the same way, but scaling my short samples
> into the range -1< n < 1 (dividing/multiplying by 32767).
The right range is +- 32768. Actually in the CVS version, all inputs and
outputs are now short, so it solves the problem.
> 1) How should I call the echo canceller with frames of short samples?
Not sure I understand the question?
> 2) Could the apparent "no effect" be due to also later using the
> preprocessor on the frames? I.e. if the echo canceller is only
> reducing the echo by -20 db or something, the AGC will later bring it
> right back. Is this the reason for the noise array? Should it work at
> all without that code (that I've read isn't quite complete yet?). [I
> haven't tried to use that yet, because the library architecture
> currently has the echo canceller down in the audio driver, where it
> gets well-correlated input/output buffers, and the preprocessing is
> much higher, in the audio-device independent layer, where it only has
> input buffers -- so it will be a bit of work to try this out].
I'm not sure what's the problem. First, you need to know that the echo
canceller is still in experimental state. The theory of echo
cancellation is rather simple, but the implementation is not. For
example, in order to get good results, you need a good crosstalk
detector. The current one kind of sucks. One thing too. In your example,
you have a 500 sample (not ms) filter length. However, if the
input/output offset introduced by your card is larger than that (or in
the same order), then you won't have any cancellation at all.
Jean-Marc Valin, M.Sc.A., ing. jr.
Université de Sherbrooke, Québec, Canada
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 190 bytes
Desc: Ceci est une partie de message numériquement signée
Url : http://lists.xiph.org/pipermail/speex-dev/attachments/20031109/077c9fe8/signature.pgp
More information about the Speex-dev