[Speex-dev] Jitter buffer
Steve Kann
stevek at stevek.com
Wed Nov 17 09:28:39 PST 2004
Jean-Marc Valin wrote:
>>In particular, (I'm not really sure, because I don't thorougly
>>understand it yet) I don't think your jitterbuffer handles:
>>
>>DTX: discontinuous transmission.
>>
>>
>
>That is dealt with by the codec, at least for Speex. When it stops
>receiving packets, it already knows whether it's in DTX/CNG mode.
>
>
[skipping a bunch of implementation details of the speex_jb, that I
haven't studied enough to respond to accurately; I'll get back to them]
I guess I have to look in more depth. So, if I send packets at 20, 40,
60, 80, then stop until 200, 220, 240, 260, won't this jb get confused?
Or, is it relying on in-band signalling of speex so it will ask speex to
predict what it thinks are lost frames, and speex would know that the
interpolation should be silence (or CNG).
>>Because we need to synchronize multiple speakers in the conference:
>>On the incoming side, each incoming "stream" has it's own timebase and
>>timestamps, and jitter. If we just passed that through (even if we
>>adjusted the timebases), the different jitter characteristics of each
>>speaker would create chaos for listeners, and they'd end up with
>>overlapping frames, etc..
>>
>>
>
>Assuming the clocks aren't synchronized (and skewed), I don't see what
>you're gaining in doing that on (presumably) a server in the middle
>instead of directly at the other end.
>
>
In the conference app, every (frame time), I need to:
1) Determine who is presently speaking; for some clients, we use remote
VAD and DTX. For some clients, we do VAD locally.
2) Notify an external application about changes in speaking
3) Send the appropriate frames to each participant, encoded properly for
each
For one-speaker case, all participants except the speaker get the
frame.
If the participant and speaker use the same codec, we just send the
same frame to them. If they don't, we transcode the frame for the new
codec type. (we reuse the transcoded frame for each participant with
the same codec).
For the two-(or more) speaker case, each speaker gets the other
speaker's frame (transcoded if needed), and we mix and recode the
summation of each speaker for all others.
In the application we're using, there can be a _lot_ of jitter (not just
the 200ms worth that your jitterbuffer seems to account for, but 1
second or more), and if we don't dejitter first, we can easily end up
with cases where:
a) We send out subsequent frames for different speakers with overlapping
timestamps.
b) Different speakers have different clock skews, and over time, these
will be very significant. In this case, as speakers change, listeners
will see this as a _huge_ jitter. (i.e. many seconds worth).
-SteveK
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.xiph.org/pipermail/speex-dev/attachments/20041117/699705bc/attachment-0001.html
More information about the Speex-dev
mailing list