[Speex-dev] Jitter buffer

Wed Nov 17 09:28:39 PST 2004

Jean-Marc Valin wrote:

>>In particular, (I'm not really sure, because I don't thorougly
>>understand it yet) I don't think your jitterbuffer handles:
>>
>>DTX: discontinuous transmission.
>>    
>>
>
>That is dealt with by the codec, at least for Speex. When it stops
>receiving packets, it already knows whether it's in DTX/CNG mode.
>  
>

[skipping a bunch of implementation details of the speex_jb, that I 
haven't studied enough to respond to accurately;  I'll get back to them]
I guess I have to look in more depth.  So, if I send packets at 20, 40, 
60, 80, then stop until 200, 220, 240, 260, won't this jb get confused?  
Or, is it relying on in-band signalling of speex so it will ask speex to 
predict what it thinks are lost frames, and speex would know that the 
interpolation should be silence (or CNG).

>>Because we need to synchronize multiple speakers in the conference:
>>On the incoming side, each incoming "stream" has it's own timebase and
>>timestamps, and jitter.  If we just passed that through (even if we
>>adjusted the timebases), the different jitter characteristics of each
>>speaker would create chaos for listeners, and they'd end up with
>>overlapping frames, etc..
>>    
>>
>
>Assuming the clocks aren't synchronized (and skewed), I don't see what
>you're gaining in doing that on (presumably) a server in the middle
>instead of directly at the other end.
>  
>

In the conference app, every (frame time), I need to:

1) Determine who is presently speaking; for some clients, we use remote 
VAD and DTX.  For some clients, we do VAD locally.
2) Notify an external application about changes in speaking
3) Send the appropriate frames to each participant, encoded properly for 
each
     For one-speaker case, all participants except the speaker get the 
frame.
     If the participant and speaker use the same codec, we just send the 
same frame to them. If they don't, we transcode the frame for the new 
codec type.  (we reuse the transcoded frame for each participant with 
the same codec).

    For the two-(or more) speaker case, each speaker gets the other 
speaker's frame (transcoded if needed), and we mix and recode the 
summation of each speaker for all others.

In the application we're using, there can be a _lot_ of jitter (not just 
the 200ms worth that your jitterbuffer seems to account for, but 1 
second or more), and if we don't dejitter first, we can easily end up 
with cases where:

a) We send out subsequent frames for different speakers with overlapping 
timestamps.
b) Different speakers have different clock skews, and over time, these 
will be very significant.  In this case, as speakers change, listeners 
will see this as a _huge_  jitter.  (i.e. many seconds worth).

-SteveK
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.xiph.org/pipermail/speex-dev/attachments/20041117/699705bc/attachment-0001.html