[Speex-dev] Stream Synchronization for Echo Cancellation

Wed Nov 1 13:58:45 PST 2006

>>> In those cases, when you get let's say 1000 packets of 20ms from the mic
>>> you may have only 990 packets of 20ms from RTP incoming stream.
>>>
>>> Thus, before sending outgoing mic/RTP stream, you would wait for 1000
>>> incoming packets: where last packet in fact arrive 10*20ms = 200ms
>>> after it was supposed to. I have from my experience already seen 4s
>>> of clock deviation each minutes between one USB headset and other
>>> sound card....
>>>
>>> In this case, synchronisation is a nightmare. It seems to be similar
>>> issue than the one described in your link, but the difference is really
>>> unpredictable and the resolution does not seems as simple...
>>>
>>> Anybody that wish to share experience on this?
>>
>> Actually, the jitter buffer in Speex tends to cope relatively well with
>> non-synchronised clocks.
>
> Can you explain why?
>
> My problem is not at all related to local input/output non-synchronised 
> clocks: my problem is really between non-synchronised clock between one
> PC and another...
>
>> The only that that really doesn't like it is the echo canceller.
>
> In my above case, If I add 10 extra packets regularly in the incoming 
> stream (the one that miss 10 packets), the echo canceller is working 
> perfectly.
>
If your microphone and speaker clocks are locked, then the echo canceller 
will be happy.  However, the Speex decoder needs to run according to the 
local timing, not according to the RTP packet arrival rate.  Otherwise, the 
output sample stream will over/underrun, and that will kill the echo 
canceller.  That is one function of a jitter buffer.  If you monitor the 
fill level of the buffer, you can drop or duplicate frames when some 
threshold is reached (rather than doing this at fixed intervals based on a 
measured packet arrival rate).  This is less disruptive than having the 
jitter buffer delay rebuild when it overruns/underruns.  In the presence of 
jitter, the measurement gets more difficult, of course.  I do not know how 
the Speex jitter buffer works in this situation, since I use something 
different in my application.

> I was just trying to comment on the paper you linked to: My opinion is 
> that the problem don't only comes from local hardware (where non-synchro 
> clocks leads to problem with aec). There are other problems with different
> clocks on 2 remote hardware. (where non-synchro does not lead to aec 
> issue, but leads to missing data (sometimes no data is played) or too much 
> data (the application has to discard else the voice delay is growing 
> because a buffer is growing)
>
> The only way would be to extend or reduce frames: so my question was:
> does anybody here have ever tried this in real time on audio streaming?
> Any simple idea to do this?

If there is no jitter or packet loss, then it is easy:  just buffer a couple 
of frames, and then repeat a frame if you run out, and drop a frame if the 
buffer gets too full.

You need some kind of jitter buffer, certainly, but I wonder if the USB 
problem is really the sample clock rate.  You are talking about a 1% error 
in frequency.  A typical spec for voiceband modems (e.g. V.32) is 0.01%.  To 
get a 1% error, the USB device would have to have a resistor/capacitor 
timing circuit instead of a crystal oscillator, and that is hard to believe. 
I suppose in a cheap headset, anything is possible.

Is it possible that there is a processing problem such that samples are 
being dropped on the USB interface, which creates the apparently low sample 
rate (only 99 of every 100 samples are encoded)?  You could still compensate 
for that with jitter buffer adjustments, as above, but the audio would 
certainly be degraded.

- Jim