[speex-dev] Preprocessing and Echo Cancellation Notes.
stevek at stevek.com
Sat Nov 8 18:55:24 PST 2003
First, I'd just like to thank the Speex community, and Jean-Marc
especially, for their great work.
I'm developing a VoIP library (which uses IAX, the asterisk protocol)
as the network protocol. I've been putting off integrating Speex for a
while, as things have been working pretty well so far with GSM. (for
those interested, the code is at iaxclient.sourceforge.net).
However, as google has recently picked up the new signal processing
stuff that has been added to Speex, that has been very interesting to
I've been working with the speex-preprocessing stuff so far, with
relatively impressive results. It's been a lot of fun so far, and like
to give my (subjective) feedback here:
1) AGC: This seems to work pretty well in all cases. I had previously
hacked-in the "compander" filter from sox for a similar effect. What
I've noticed is that speex_preprocess's AGC has no "knobs", and it
seems to use an attack/decay that is a lot faster than what I had
chosen from the sox compander, but it works pretty well nonetheless. I
think your choices may have been better. It's amazing how little
difference I can hear now regardless of how I have my microphone gain
set, from about 10% to 90% gain.
2) VAD: I never had a good VAD implementation in the library; I had a
user-configurable audio energy threshold that did this, plus, I had a
hokey algorithm where I did a pretty naive estimate of the noise floor,
and then considered anything 5dB above that to be speech. This worked
OK, but since I never updated my "noise floor" estimate, it was easily
broken if there was additional noise added at any time (i.e. the user
raised their microphone gain). Here, I have gone in and adjusted some
<p>/* if (st->speech_prob> .35 || (st->last_speech < 20 &&
if (st->speech_prob> .30 || (st->last_speech < 20 &&
to make it more sensitive, because I was getting some missed speech,
and some dropouts. The dropouts were especially troubling, because
they caused a big degradation in speech in some cases. The second
parameter helped a bit in this case, but I think there might be a
smarter implementation yet -- like immediately lowering the threshold
once speech is detected, and then raising it gradually based on the
I had also experimented with the 3GPP AMR VAD code (which is, of
course, copyrighted) to see how it compares, and it was still better
than speex, but speex is still pretty good.
3) denoising: This option was the most interesting. Previously, using
an omnidirectional microphone, like that in a notebook or whatnot, to
pick up speech gave a really poor SNR; conversation was possible, but
it was quite annoying. With the speex denoising filter, it comes
through really clear, pretty much as good as if one were using a
headset and a directional mic right next to the speaker. Combined with
AGC, this was very effective. It does have lots of "interesting"
things that it does, however:
a) The most interesting thing it does is sometimes it also de-voices
speech. I.e. if you say "aaaaaaa" into the filter, after about 3
seconds, you're voice just disappears :). I thought this was
interesting, and I wanted to see how smart it was, so instead of a
single vowel sound, I tried repeating vowel-consonant pairs, like
"badumpbadumpbadump", and If I was consistent enough with that, I could
make them mostly disappear as well. This was lots of fun. What it
points out, though, is that denoising and, say, singing, won't go along
very well at all! I'm also wondering if it could be used to cancel out
a boring speaker :)
b) There are some "musical" artifacts left over. They're not huge,
but I did notice them as voices faded out, etc. I'm guessing this is
de-noising, but I was using denoise + AGC at the time, so I'm not sure;
if AGC is just scaling, then I guess it must be the denoise. I'll
probably add options to my UI to individually control the different
filters, which will make evaluation easier.
Finally, echo cancellation. I haven't actually been able to get the
echo canceller to do anything really useful for me. I'm currently
using it something like this:
ec = speex_echo_state_init(160, 500); /* in ms */
/* convert buffers to float, echo cancel, convert back */
float finBuffer, foutBuffer, fcancBuffer;
finBuffer[i] = virtualInBuffer[i];
foutBuffer[i] = ((short *)outputBuffer)[i];
//fprintf(stderr, "echo cancelling virtual mono frame\n");
speex_echo_cancel(ec, finBuffer, foutBuffer, fcancBuffer,
virtualInBuffer[i] = (short)(fcancBuffer[i]);
I've also tried to use it the same way, but scaling my short samples
into the range -1< n < 1 (dividing/multiplying by 32767).
When I scaled, the echo canceller seemed to have no effect at all.
When I don't scale, all kinds of strange things happen :).
[as I write this, I've been trying some more things. First thing I
realized is that [duh] my "frames" in this audio driver are 10ms, not
20 ms, so they're only samples long. So, it's no wonder the echo
canceller didn't do anything, because each frame it was given was 10ms
of real stuff, and 10ms of garbage :). After I fixed that blunder, I
got the echo canceller to do _things_ but not actually cancel echo.
Mostly, it introduced additional echo.
So, my questions are:
1) How should I call the echo canceller with frames of short samples?
2) Could the apparent "no effect" be due to also later using the
preprocessor on the frames? I.e. if the echo canceller is only
reducing the echo by -20 db or something, the AGC will later bring it
right back. Is this the reason for the noise array? Should it work at
all without that code (that I've read isn't quite complete yet?). [I
haven't tried to use that yet, because the library architecture
currently has the echo canceller down in the audio driver, where it
gets well-correlated input/output buffers, and the preprocessing is
much higher, in the audio-device independent layer, where it only has
input buffers -- so it will be a bit of work to try this out].
For echo cancellation, there's a couple of situations where users might
1) stupid windows audio driver/card setups, where it is really
difficult and non-obvious, or in some cases seemingly impossible to
cause them to _not_ capture playout sound. This should be a relatively
easy echo to cancel, but it's quite annoying to have to do that.
2) acoustic echo. The normal cases, where people are using open-air
loudspeakers and microphones, as well as the degenerate case, which is
Apple Powerbooks, where the microphone is actually embedded _in_ the
left speaker enclosure. This is what I've been testing with so far,
actually. Apple's iChat AV kinda cheats in this regard a bit; they
seem to only play outgoing audio from the right speaker on powerbooks.
I think they also have some API to tell if the user has put in a set of
headphones, so they play through both Left and Right in that case.
Since I expect that one of my primary use cases for the library will
involve using the application in a multi-user conference, echo will be
a killer. For now, this can be alleviated by using "push to talk", but
it would be nice if it were feasible to have a completely automatic
setup, with VAD and echo cancellation.
Thanks again for your great work, and any comments out there on my
experiences and problems.
--- >8 ----
List archives: http://www.xiph.org/archives/
Ogg project homepage: http://www.xiph.org/ogg/
To unsubscribe from this list, send a message to 'speex-dev-request at xiph.org'
containing only the word 'unsubscribe' in the body. No subject is needed.
Unsubscribe messages sent to the list will be ignored/filtered.
More information about the Speex-dev