[Speex-dev] Jitter buffer
Jean-Marc Valin
jean-marc.valin at usherbrooke.ca
Wed Nov 17 09:36:42 PST 2004
> [skipping a bunch of implementation details of the speex_jb, that I
> haven't studied enough to respond to accurately; I'll get back to
> them]
> I guess I have to look in more depth. So, if I send packets at 20,
> 40, 60, 80, then stop until 200, 220, 240, 260, won't this jb get
> confused? Or, is it relying on in-band signalling of speex so it will
> ask speex to predict what it thinks are lost frames, and speex would
> know that the interpolation should be silence (or CNG).
If the frame is not there on time, it just tells Speex to make up a
frame. Speex will know whether it's a CNG frame or a missing frame based
on previous received frames.
> In the conference app, every (frame time), I need to:
>
> 1) Determine who is presently speaking; for some clients, we use
> remote VAD and DTX. For some clients, we do VAD locally.
> 2) Notify an external application about changes in speaking
> 3) Send the appropriate frames to each participant, encoded properly
> for each
> For one-speaker case, all participants except the speaker get the
> frame.
Would have to see in details anyway...
> For the two-(or more) speaker case, each speaker gets the other
> speaker's frame (transcoded if needed), and we mix and recode the
> summation of each speaker for all others.
You mean you're actually encoding the sum of several voices... You may
have a quality problem here as no speech codec (outside of PCM and maybe
ADPCM) is designed to handle that. Speex at high bit-rate may work, but
it's not optimal.
> In the application we're using, there can be a _lot_ of jitter (not
> just the 200ms worth that your jitterbuffer seems to account for, but
> 1 second or more), and if we don't dejitter first, we can easily end
> up with cases where:
Why are you saying 200ms only? If you mean the max buffer size, that can
(and should) be increased easily.
> a) We send out subsequent frames for different speakers with
> overlapping timestamps.
> b) Different speakers have different clock skews, and over time, these
> will be very significant. In this case, as speakers change, listeners
> will see this as a _huge_ jitter. (i.e. many seconds worth).
Not sure what you mean here.
Jean-Marc
More information about the Speex-dev
mailing list