[Speex-dev] Jitter buffer

Wed Nov 17 09:36:42 PST 2004

> [skipping a bunch of implementation details of the speex_jb, that I
> haven't studied enough to respond to accurately;  I'll get back to
> them]
> I guess I have to look in more depth.  So, if I send packets at 20,
> 40, 60, 80, then stop until 200, 220, 240, 260, won't this jb get
> confused?  Or, is it relying on in-band signalling of speex so it will
> ask speex to predict what it thinks are lost frames, and speex would
> know that the interpolation should be silence (or CNG).

If the frame is not there on time, it just tells Speex to make up a
frame. Speex will know whether it's a CNG frame or a missing frame based
on previous received frames.

> In the conference app, every (frame time), I need to:
> 
> 1) Determine who is presently speaking; for some clients, we use
> remote VAD and DTX.  For some clients, we do VAD locally.
> 2) Notify an external application about changes in speaking
> 3) Send the appropriate frames to each participant, encoded properly
> for each 
>      For one-speaker case, all participants except the speaker get the
> frame. 

Would have to see in details anyway...

>     For the two-(or more) speaker case, each speaker gets the other
> speaker's frame (transcoded if needed), and we mix and recode the
> summation of each speaker for all others.

You mean you're actually encoding the sum of several voices... You may
have a quality problem here as no speech codec (outside of PCM and maybe
ADPCM) is designed to handle that. Speex at high bit-rate may work, but
it's not optimal.

> In the application we're using, there can be a _lot_ of jitter (not
> just the 200ms worth that your jitterbuffer seems to account for, but
> 1 second or more), and if we don't dejitter first, we can easily end
> up with cases where:

Why are you saying 200ms only? If you mean the max buffer size, that can
(and should) be increased easily.

> a) We send out subsequent frames for different speakers with
> overlapping timestamps.
> b) Different speakers have different clock skews, and over time, these
> will be very significant.  In this case, as speakers change, listeners
> will see this as a _huge_  jitter.  (i.e. many seconds worth).

Not sure what you mean here.

	Jean-Marc