[Speex-dev] How does the jitter buffer "catch up"?

Sun Sep 18 09:25:21 PDT 2005

> 
> Is is possible to give a short hint about how the jitter buffer would
> "catch up" when network condition have been bad and then get better?
>
> I'm using the jitter buffer with success now, but sometimes I have a
> long delay that's caused by bad network conditions and then later when
> the conditions get better, I would think we would want the audio to
> gradually catch up with real-time to minimize the latency in the voice?
>
> Is it not realistic to expect the jitter buffer to do this sort of
> "catching up" (of course doing so by "skipping" some of the older
> received audio I guess)?
>
> I understand the basic idea of the jitter.c code but am aparently not
> bright enough to get the whole point of the short- and long-term margin
> values etc. Just wonder if it's possible to get a short description of
> each of these variables, their purpose and how they apply to the whole
> jitter buffer functionality?
>
> Thank you very much.
>
> Baldvin
>
>

FYI: The below is just my interpretation of the code, I might be wrong.

Each time a new packet arrives, the jitter buffer calculates how far ahead 
or behind the "current" timestamp it is; this is called arrival_margin. 
The  "current" timestamp is simply the last frame successfully decoded.

It maintains a list of bins for margins, this is short and longterm 
margin.

Think of the bins like this:

-60ms -40ms -20ms 0ms +20ms +40ms +60ms

when a packet arrives, the margin matching it's arrivel_margin is 
increased, so if this packet was 40ms after the current timestamp, the 
40ms bin would be increased. If this packet arrived 60ms too late (and 
hence is useless), the -60ms bin would increase.

early_ratio_XX is the sum of all the positive bins.
late_ratio_XX is the sum of all the negative bins.

The difference between _long and _short is just how fast they change.

If a packet has timestamp outside the bins, it's not used for calculation.

Now, clearly, if early_ratio is high and late_ratio is very low, the 
buffer is buffering more than it needs to; it will skip a frame to reduce 
latency. Alternately, if late_ratio is even marginally above 0, more 
buffering is needed, and it duplicates a frame. This decision is done when 
decoding.

Depending on your chosen transmission method, during network hiccups 
you'll either have lost packets or they'll come in a burst when the 
network conditions restore themselves. In either case, after missing 20 
packets or so the jitter buffer will prepare to "reset", and it's new 
current timestamp will be the timestamp on whatever packet arrives. It 
will also hold decoding until at least buffer_size frames have arrived.

Since it sounds like you're using reliable transmission (packets are not 
lost), what will happen is that there's a whole stream of packets suddenly 
arriving, and they'll fill up the buffer much much faster than it's 
emptied. In fact, you're likely to fill it so fast the buffer runs out of 
room, meaning the first few packets gets dropped to make room for the 
later ones. However, as the current timestamp was set to the first 
arriving packet, the decoder won't find the packet it's looking for, 
meaning the jitter buffer will soon reset again.

So no, it doesn't "catch up", it tries to keep latency to an absolute 
minimum whatever the circumstances, so most of the late frames will be 
dropped.

To achieve the effect you're describing, you'd need to increase
SPEEX_JITTER_MAX_BUFFER_SIZE to the longest delay you're expecting, and 
then inside the block on line 231 (which says)
    if (late_ratio_short + ontime_ratio_short < .005 && late_ratio_long + 
ontime_ratio_long < .01 && early_ratio_short > .8)
.. add something that multiplies all the magins with 0.75 or so at the 
end. This will force the jitter buffer to only skip 1 frame at a time and 
wait a bit before it skips the next one.