[Speex-dev] VAD Questions

Fri Jun 8 08:13:09 PDT 2007

Hello Jean-Marc et al:

On 07/06/07, Jean-Marc Valin <jean-marc.valin at usherbrooke.ca> wrote:
> > - Is there a reference somewhere (other than the source itself) that
> > explains how the latest VAD algorithm works?
>
> Read the source, Luke :-) (sorry)

Okay. I had to ask :-)

>
> > - Is it possible to obtain the VAD status of a Speex stream
> > asynchronously? The current API seems to imply that some kind of
> > polling is required to determine the voice/non-voice status.
>
> Don't understand your question. Also which VAD are you talking about?
> The one in the encoder or the one in the preprocessor?

Either one. The question is: If we treat the software like a black
box, and we feed in PCM audio, we get Speex encoded data out. Where is
the information that indicates whether the encoded data contains
speech or not? The API has a "get VAD status", but it seems like that
might only indicate whether VAD is currently enabled. Perhaps the VAD
status is contained somewhere in the data frames?

>
> > - Does the VAD algorithm implement syllabic/sonorant rate detection,
> > as has been implemented many times in analog circuitry, and is
> > described in this (and other) papers?
> > http://people.csail.mit.edu/jrg/2005/IS05_schutte.pdf
>
> As far as I understand, the paper you reference above isn't applicable
> to the problem here. Basically, we have to decide whether we have speech
> or silence based only on 20 ms of audio (and the past). If we could
> "look into the future" of the signals, things would be much easier.
>
> > - Over what time period is VAD done? Is it done on a frame by frame
> > basis or over some longer period?
>
> It *has* to be done frame by frame, otherwise you add latency, which
> isn't acceptable.

Okay. What I was trying to determine was whether or not the speech
detection was done with something more sophisticated than frame
energy. As you said above, I'll have to look at the sources. For many
systems, sonorant energy rate detection is used to detect voice, even
under very poor SNR conditions.

Cheers,
-- 
Larry Gadallah, VE6VQ/W7                          lgadallah AT gmail DOT com
PGP Sig: 616D 4E52 CF1F 3FEC FFFB  F11B 7DB9 C79A EA7E B25B