<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

  <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">

  <title></title>

</head>

<body bgcolor="#ffffff" text="#000000">

Jean-Marc Valin wrote:

<blockquote cite="mid1100710393.3788.42.camel@localhost" type="cite">

  <blockquote type="cite">

    <pre wrap="">In particular, (I'm not really sure, because I don't thorougly

understand it yet) I don't think your jitterbuffer handles:

DTX: discontinuous transmission.

    </pre>

  </blockquote>

  <pre wrap=""><!---->

That is dealt with by the codec, at least for Speex. When it stops

receiving packets, it already knows whether it's in DTX/CNG mode.

  </pre>

</blockquote>

<br>

[skipping a bunch of implementation details of the speex_jb, that I

haven't studied enough to respond to accurately;&nbsp; I'll get back to them]<br>

I guess I have to look in more depth.&nbsp; So, if I send packets at 20, 40,

60, 80, then stop until 200, 220, 240, 260, won't this jb get

confused?&nbsp; Or, is it relying on in-band signalling of speex so it will

ask speex to predict what it thinks are lost frames, and speex would

know that the interpolation should be silence (or CNG).<br>

<br>

<blockquote cite="mid1100710393.3788.42.camel@localhost" type="cite">

  <blockquote type="cite">

    <pre wrap="">Because we need to synchronize multiple speakers in the conference:

On the incoming side, each incoming "stream" has it's own timebase and

timestamps, and jitter.  If we just passed that through (even if we

adjusted the timebases), the different jitter characteristics of each

speaker would create chaos for listeners, and they'd end up with

overlapping frames, etc..

    </pre>

  </blockquote>

  <pre wrap=""><!---->

Assuming the clocks aren't synchronized (and skewed), I don't see what

you're gaining in doing that on (presumably) a server in the middle

instead of directly at the other end.

  </pre>

</blockquote>

<br>

In the conference app, every (frame time), I need to:<br>

<br>

1) Determine who is presently speaking; for some clients, we use remote

VAD and DTX.&nbsp; For some clients, we do VAD locally.<br>

2) Notify an external application about changes in speaking<br>

3) Send the appropriate frames to each participant, encoded properly

for each <br>

&nbsp;&nbsp;&nbsp;&nbsp; For one-speaker case, all participants except the speaker get the

frame. <br>

&nbsp;&nbsp;&nbsp;&nbsp; If the participant and speaker use the same codec, we just send

the same frame to them. If they don't, we transcode the frame for the

new codec type.&nbsp; (we reuse the transcoded frame for each participant

with the same codec).<br>

<br>

&nbsp;&nbsp;&nbsp; For the two-(or more) speaker case, each speaker gets the other

speaker's frame (transcoded if needed), and we mix and recode the

summation of each speaker for all others.<br>

<br>

In the application we're using, there can be a _lot_ of jitter (not

just the 200ms worth that your jitterbuffer seems to account for, but 1

second or more), and if we don't dejitter first, we can easily end up

with cases where:<br>

<br>

a) We send out subsequent frames for different speakers with

overlapping timestamps.<br>

b) Different speakers have different clock skews, and over time, these

will be very significant.&nbsp; In this case, as speakers change, listeners

will see this as a _huge_&nbsp; jitter.&nbsp; (i.e. many seconds worth).<br>

<br>

-SteveK<br>

</body>

</html>