[Speex-dev] How does the jitter buffer "catch up"?

Sun Sep 18 18:44:32 PDT 2005

>> FYI: The below is just my interpretation of the code, I might be wrong.
>
> Most of it is right. Actually, would you mind if I use part of your
> email for documenting the jitter buffer in the manual?

It would be my pleasure :)

>> early_ratio_XX is the sum of all the positive bins.
>> late_ratio_XX is the sum of all the negative bins.
>
> Right. And only the packets that are "just in time" don't get counted in
> any ratio.

Well.. they're counted in the ontime_ratio_long and _short, right?

One thing that might be worth mentioning: the sum of all the margins will 
never be higher than 1.0, so a test for early_ratio_short > 0.7 means 
(roughly) that 70% or more of the packets in the last short-term time 
period were early.

>> Depending on your chosen transmission method, during network hiccups
>> you'll either have lost packets or they'll come in a burst when the
>> network conditions restore themselves. In either case, after missing 20
>> packets or so the jitter buffer will prepare to "reset", and it's new
>> current timestamp will be the timestamp on whatever packet arrives. It
>> will also hold decoding until at least buffer_size frames have arrived.
>
> Right, except it will only actually reset when receiving the first new
> packet.

That's when I meant with "will be the timestamp on whatever packet 
arrives". .. Could be clearer though, I totally agree.

>> Since it sounds like you're using reliable transmission (packets are not
>> lost), what will happen is that there's a whole stream of packets suddenly
>> arriving, and they'll fill up the buffer much much faster than it's
>> emptied. In fact, you're likely to fill it so fast the buffer runs out of
>> room, meaning the first few packets gets dropped to make room for the
>> later ones. However, as the current timestamp was set to the first
>> arriving packet, the decoder won't find the packet it's looking for,
>> meaning the jitter buffer will soon reset again.
>
> I'm not sure here what will happen. Normally, you'd want to make the
> buffer larger than what you expect to have in it. In that case, the
> jitter buffer would likely drop frames until it catches up.

There's a problem with increasing the buffer size, btw: you need to change 
the header, which means you need to recompile both speex and your 
application. So changing the maximum number of buffered packets means you 
can't share libspeex.dll/.so with other applications.

>> To achieve the effect you're describing, you'd need to increase
>> SPEEX_JITTER_MAX_BUFFER_SIZE to the longest delay you're expecting, and
>> then inside the block on line 231 (which says)
>>     if (late_ratio_short + ontime_ratio_short < .005 && late_ratio_long +
>> ontime_ratio_long < .01 && early_ratio_short > .8)
>> .. add something that multiplies all the magins with 0.75 or so at the
>> end. This will force the jitter buffer to only skip 1 frame at a time and
>> wait a bit before it skips the next one.
>
> Don't think it's necessary since there's already some code that shifts
> the histogram whenever I skip or interpolate a packet. This means that
> if the packets are on average 20 ms in advance when we drop a frame,
> then they will be considered all "on time" (0 ms) after that.

Yes, but assume that after a long steady period, your network latency 
suddenly drops with 100ms. (100ms is excessive, but I see 60ms quite 
frequently from users on DSL/Cable connections who also do a bit of P2P 
on the same line)

What happens now is that the +100ms bin starts increasing steadily,
and suddenly it's enough to skip a frame.

A frame is skipped, and the histogram gets shifted.

On the next call to _get(), it's now the +80ms bin that has that high 
value, and the ratio is still more than high enough to skip a frame.

A frame is skipped, and the histogram gets shifted.

Repeat for +60, +40 and +20. In short, over a period to decode 5 frames, 
we're also skipping 5 frames, which means you have 100ms of audio that 
sounds weird.

It works well for me though, I prefer that sudden network jumps result in 
an audible "jump" in dialogue rather then users not being sure that 
latency is at an absolute minimum.

Come to think of it, it might actually be better if it just skipped 5 
frames at once. Might be doable by shifting the histogram, and if it still 
meets the criteria, keep skipping and shifting it until it doesn't meet 
the criteria anymore. More work though, and less clear code.