[xiph-rtp] about theora-over-rtp draft

Mon Jul 24 03:31:02 PDT 2006

>
> Just a request: configure your client to break the lines at the 79th
> col, thank you
I'm sorry ! It's apparently the case, but perhaps it's buggy... I'm going to  
send this one in pure text mode in case it works better.

>
> The rfc assumes that people will read the normative refs, I'd just move
> the theora ref to the normative section.

I think everything you say in a RFC that helps implementors is really welcomed 
and avoid mis-interpretations and buggy implementations.

>
> you lose something at the start of a fragment chain, what would you do?
>
> If it were just another single packet (marker bit not set) you just
> ignore this loss, if you had lost the start of a fragmented frame you'd
> feed the decoder with something more or less unexpected.
>
>          1         2 time
> 12345678901234567890
> NMNNNMMMMMMNNMMMNNNN bit Marked or Notmarked
>  ^lost      ^lost
> _||__|||||||_||||___ _fullframe |frags
> FSEFFSCCCCCEFSCCEFFF Fullframe, Startfrag, Contfrag, Endfrag
>
> in my case you know what's lost.

But you can do the same thing using the sequence number. Marker bit says: that 
packet carries the end of frame. As a consequence a fullframe has the marker 
bit (which is not the case in your example). Sequence number discontinuities 
can be handled by the application by dropping packets until a marker bit is 
received, then restart the accumulate algo I've described.
The only advantage I see of your technique (I've just realized it) is that you 
can differentiate a FullFrame from an "end of frame with begin lost". 
But anyway you could also decide to always pass a marked packet to the decoder 
(regardless it can be fullframe or just an end of frame with possibly lost 
start) and see what happens. If it is not a fullframe the theora decoder 
won't be able to decode it.

>
>    At any time, either agent MAY generate a new offer that updates the
>    session.  However, it MUST NOT generate a new offer if it has
>    received an offer which it has not yet answered or rejected.
>    Furthermore, it MUST NOT generate a new offer if it has generated a
>    prior offer for which it has not yet received an answer or a
>    rejection.  If an agent receives an offer after having sent one, but
>    before receiving an answer to it, this is considered a "glare"
>    condition. [rfc3264 - 4 Protocol Operation]

This paragraphs talks about renegociation. This technique is actually handled 
by very very few implementations and was not really done to make complex call 
setup, but call parameters changes during a call, for example port and codec 
changes when entering a conference after a simple call. Or for example a 
client that decides to stop video the video session but keeps audio.

I think it is worth to have a SDP scheme that allows to setup the session 
efficiently and according to the network capabilities with a minimal message 
exchange, I mean just an offer and an answer. This is possible in RFC2429-bis 
and all audio codecs, why shouln't it be possible with theora ?

> But if you send _multiple_ offers the client can pick which one they
> like most, keep in mind that _nothing_ is preventing you to:
>
> - use adaptative techs like the ones supported by nemesi/fenice[1]
> - use something like codec-param (I'm thinking about adding it soon)
in a=fmtp line ?
> - keep the offer-answer ballet till you get to agree.
too complex and not the goal of RFC3264.

>
> > But how the decoder side will inform the encoder side of the config it
> > has choosen ?
>
> The only remaining methods/configurations in the answer.
So this assume a symetric stream since only the answerer has the choice of the 
format parameters... Not good.

> > Let's imagine the SDP offerer suggests 3 theora config using various
> > bandwidth constraint.
> > The SDP answerer can eventually tell which one of the config it chooses.
> > But as it also sends a theora stream and will also suggests various
> > theora configs int the same way
> > Unfortunately as the SDP offerer has no way to indicate the config it
> > chooses as there is no third message.
>
> why not?

It's offer-answer model. Not offer-answer-acknolegment. A third message would 
mean a re-negociation, ie the answerer answers and then re-INVITES with a new 
SDP offer, and the initial offerer will then have to answer... a great moment 
of coding when implementing this...
Theora risks not to be popular if you force implementors to do such things...

>
>
> > Lastly, I forgot to tell about the possibility offered by the draft to
> > put several theora frames in a single rtp packet. My opinion is that this
> > is completely useless because RTP is done for real-time streaming, and
> > putting several theora frames in a same rtp packet means buffering theora
> > frames before sending them, thus it's no more real-time.
> > Despite this possibility is often offered by audio codecs to save
> > bandwidth by reducing rtp overhead, it has never been used for video
> > codecs, simply because the huge size of video packets compared to the rtp
> > header makes the gain of bandwidth very limited. Thus it's simply more
> > simple to rely on the underlining protocol (UDP/RTP) to know the size the
> > video packets, and assume that each RTP packet contains at most 1 video
> > frame.
>
> You forgot that:
>  - using rtp-theora for yuotube-like application is not so far from
> possibility (you want low bitrate with nice but not perfect quality)
>  - how many museconds/milliseconds/seconds you buffer in those not so
> real-time applications?
At 15 fps this would make a 66 milisecond buffering. As the webcam usually 
delivers frames with more or less 100 miliseconds latency, the result is not 
very good. As usually voice can be transmitted with around 60 milisecond 
latency (soundcard+encoding), this would force the application to delay voice 
with 100 additional miliseconds to keep audio and video synchronised. Then 
add the network transmission delay, around 50 miliseconds on xDSL, you obtain 
around 210 milisecond, then you need to add 50 more miliseconds for jitter 
compensation at the other end. 260 milisecond end to end delay, users won't 
accept that. In classic telephony or mobile telephony, it rarely exceeds 100 
miliseconds.
In this scenario I agree that the video buffering is not the main cause of 
latency, but it's clear that an implementor that needs to reduce the latency 
of its telephony application will NEVER have fun to bufferize video frames to 
put them in a single RTP packet.

What are for you the benefits of being able to put several theora frames in a 
single RTP packet ?

>
> the draft isn't preventing you to disregard collated packets at all.
The draft isn't preventing me to only transmit packet that contain at most one 
frame. But if I want to comply with the draft, I MUST be able to deal with 
packets that contain several frames... isn't it ? Otherwise my application 
will not be able to decode streams from another that would use this 
functionnality.
Relying on the packetisation already provided by UDP is more natural and makes 
this to work without the need of a single line of code.

To sum up all this discussion, I'd like this draft to:
- clarify the packed-conf message (limit between header and tables)
- explain a SDP operation that let each side to configure asymetrically in a 
simple offer-answer (only 2 messages) scheme (for me it implies to NOT 
transmit inline encoder configuration in SDP, which prevents the offerer to 
adapt to the other end).
- use less bits to indicate fragmentation (for me 1 bit is enough, 2 if you 
wish to indicate begin of frame and end of frame). Whether this bit is RTP 
marker or not is not important.
- assume each rtp contains at most one frame.
	The last two points in the goal of having much simpler unpacketisation code.

Currently the unpacketisation code necessary for implementing this draft makes 
it more complex than the RFC2429bis or RFC3016 packetisation, which I think 
is not good if we want more and more people like me or companies to prefer 
open-source technology instead of heavy patented ones.

Simon

note: I cc'd Aymeric Moizard (jack at atosc.org), the libosip/eXosip (SIP and SDP 
protocols implementations) author. He said to me that he were interested in 
that discussion.