[Speex-dev] Jitter buffer

Wed Nov 17 08:53:13 PST 2004

> In particular, (I'm not really sure, because I don't thorougly
> understand it yet) I don't think your jitterbuffer handles:
> 
> DTX: discontinuous transmission.

That is dealt with by the codec, at least for Speex. When it stops
receiving packets, it already knows whether it's in DTX/CNG mode.

> clock skew: (see discussion, though)

Clock skew is one of the main thing I was trying to solve. Actually in
the way my jitter buffer is implemented, it's not even aware of the
difference between a clock skew and a (linear) change in network
latency.

> shrink buffer length quickly during silence

Everything is there for that, but I'm not yet looking at
silence/non-silence.

> That may be OK when the jitterbuffer is only used right before the
> audio layer, but I'm still not sure how I can integrate this
> functionality in the places I want to put the jitterbuffer.

I guess we'll need to discuss that in further details.

> I looked at nb_celp.c, and it seems that it would be pretty messy.
> I'd need to implement a lot of the actual codec just to be able to
> determine the number of frames in a packet.

No, it's one step above nb_celp.c, all you need to implement is 8
functions (init, destroy, process and ctl, for both encode and decode).
It can be done fairly easily. Look at modes.c perhaps. The only struct
that needs to be filled is SpeexMode. Even then, I'm willing to add an
even simpler layer if necessary.

> I think the easiest thing for me is to just stick to one frame per
> "thing" as far as the jitterbuffer is concerned, and then handle
> additional framing for packets at a higher level.

Right now, my jitter buffer assumes a fixed amount of time per frame
(but not per packet). I'm not sure if that's possible.

> Even if we use the "terminator" submode (i.e.
> speex_bits_pack(&encstate->bits, 15, 5); ), it seems hard to find that
> in the bitstream, no?

Well, you just have to know the number of bits for each mode (that's
already in the mode struct, since I use it to skip wideband in some
cases) and do some jumping.

> Because we need to synchronize multiple speakers in the conference:
> On the incoming side, each incoming "stream" has it's own timebase and
> timestamps, and jitter.  If we just passed that through (even if we
> adjusted the timebases), the different jitter characteristics of each
> speaker would create chaos for listeners, and they'd end up with
> overlapping frames, etc..

Assuming the clocks aren't synchronized (and skewed), I don't see what
you're gaining in doing that on (presumably) a server in the middle
instead of directly at the other end.

	Jean-Marc