[Speex-dev] How does the jitter buffer "catch up"?
speex at natvig.com
Thu Sep 22 16:27:12 PDT 2005
First off, could you try to set your email client to break long lines before
transmitting? In my (somewhat outdated) pine, the lines appear as VERY long
lines when I try to reply, making it hard to read :)
Minor detail though, I should probably fix pine. Some day.
> The way you describe how the jitter buffer should be implemented makes me
> wonder: How does the jitter buffer works when there is no transmission?
> Let's say my "output" thread gets a speex frame from the jitter buffer every
> 20ms. What happen when there is no frame that arrived on the socket? No
> frames at all for a pretty long time (ie many seconds).
> This is my case because I chose not to transmit any sound data when speech
> was not recognized (This speech probability from the preprocessor is so
> sweet! Thanks Jean-marc!). Yes, I know, I'm cheap on bandwidth, but that's on
> purpose... :(
What happens is this:
On the first _get where there are no valid frames (because you stopped
transmitting from the other end), the jitter buffer will tell the decoder to
just decode the last frame again. On the next one, it tells the decoder to
extrapolate from the last frame, and on the next one after that to extrapolate
even more. This goes on until 25 packets are missed, at which point the jitter
buffer resets the decoder and stops extrapolating.
> I read Munble source code (v0.3.2) to see how you do. And I found this
> // Ideally, we'd like to go DTX (discontinous transmission)
> // if we didn't detect speech. Unfortunately, the jitter
> // buffer on the receiving end doesn't cope with that
> // very well.
Ah, this is a completely outdated comment, as I found a way to make it work
What I do, is append one bit to each speex packet which indicates if this is a
"end of transmission". If it is, I manually tell the jitter buffer to reset
immediately and stop extrapolating, because I know no more packets will be
If this "end of transmission" packet should be lost, no harm is done, because
all that happens is that the codec extrapolates a bit, meaning you get a few
hundred ms of alien sounds :)
In an ideal world, you'd like to use Speex DTX mode, which puts the decoder in
"generate comfort noise" mode and also transfers one packet every 400ms (I
think) to update the noise profile, but if you use the denoiser of the
preprocessor then comfort noise == silence.
> I did not implemented the jitter buffer yet, but I wonder if I should?
> I was thinking about holding the first few sound frames before playing them.
> That way, I introduce a delay, which should remove the jitter. Moreover,
> since I'm not transmitting when not speaking, the delay does not sum up to
> get pretty long in the end.
This will work, but will introduce latency in your transmission. This sort of
buffering is very common in streaming media, such as shoutcasts and
videostreams, as they are unidirectional and it doesn't matter if there's a 2
second delay between sending and receiving time. For bidirectional speech, you
want latency at an absolute minimum.
Humans start speaking when the other side isn't speaking. Let's take the
extreme case and say there's 10 seconds of delay. If you both start talking at
the same time, it'll be 10 seconds before you hear the other end is also
talking, 10 more seconds to notice that he stopped, and then 10 seconds before
he hears you say "go ahead". 10 sec is extreme, but this effect is quite
noticable even at 500ms total latency.
More information about the Speex-dev