[Speex-dev] How does the jitter buffer "catch up"?

Thu Sep 22 16:27:12 PDT 2005

> Hello,

Hi :)

First off, could you try to set your email client to break long lines before 
transmitting? In my (somewhat outdated) pine, the lines appear as VERY long 
lines when I try to reply, making it hard to read :)

Minor detail though, I should probably fix pine. Some day.

> The way you describe how the jitter buffer should be implemented makes me 
> wonder: How does the jitter buffer works when there is no transmission?
> Let's say my "output" thread gets a speex frame from the jitter buffer every 
> 20ms. What happen when there is no frame that arrived on the socket? No 
> frames at all for a pretty long time (ie many seconds).
> This is my case because I chose not to transmit any sound data when speech 
> was not recognized (This speech probability from the preprocessor is so 
> sweet! Thanks Jean-marc!). Yes, I know, I'm cheap on bandwidth, but that's on 
> purpose... :(

What happens is this:

On the first _get where there are no valid frames (because you stopped 
transmitting from the other end), the jitter buffer will tell the decoder to 
just decode the last frame again. On the next one, it tells the decoder to 
extrapolate from the last frame, and on the next one after that to extrapolate 
even more. This goes on until 25 packets are missed, at which point the jitter 
buffer resets the decoder and stops extrapolating.

> I read Munble source code (v0.3.2) to see how you do. And I found this 
> comment:
> 	// Ideally, we'd like to go DTX (discontinous transmission)
> 	// if we didn't detect speech. Unfortunately, the jitter
> 	// buffer on the receiving end doesn't cope with that
> 	// very well.

Ah, this is a completely outdated comment, as I found a way to make it work 
well :)

What I do, is append one bit to each speex packet which indicates if this is a 
"end of transmission". If it is, I manually tell the jitter buffer to reset 
immediately and stop extrapolating, because I know no more packets will be 
forthcoming.

If this "end of transmission" packet should be lost, no harm is done, because 
all that happens is that the codec extrapolates a bit, meaning you get a few 
hundred ms of alien sounds :)

In an ideal world, you'd like to use Speex DTX mode, which puts the decoder in 
"generate comfort noise" mode and also transfers one packet every 400ms (I 
think) to update the noise profile, but if you use the denoiser of the 
preprocessor then comfort noise == silence.

> I did not implemented the jitter buffer yet, but I wonder if I should?
> I was thinking about holding the first few sound frames before playing them. 
> That way, I introduce a delay, which should remove the jitter. Moreover, 
> since I'm not transmitting when not speaking, the delay does not sum up to 
> get pretty long in the end.

This will work, but will introduce latency in your transmission. This sort of 
buffering is very common in streaming media, such as shoutcasts and 
videostreams, as they are unidirectional and it doesn't matter if there's a 2 
second delay between sending and receiving time. For bidirectional speech, you 
want latency at an absolute minimum.

Why?

Humans start speaking when the other side isn't speaking. Let's take the 
extreme case and say there's 10 seconds of delay. If you both start talking at 
the same time, it'll be 10 seconds before you hear the other end is also 
talking, 10 more seconds to notice that he stopped, and then 10 seconds before 
he hears you say "go ahead". 10 sec is extreme, but this effect is quite 
noticable even at 500ms total latency.