[xiph-rtp] about theora-over-rtp draft

Thu Jul 20 14:33:51 PDT 2006

Hello,

I tried to implement the rtp payload packetisation for theora defined in draft
http://svn.xiph.org/trunk/theora/doc/draft-barbato-avt-rtp-theora-01.txt
(the most recent I've found).

I'm the author and maintainer of linphone, a free software SIP video phone 
(http://www.linphone.org) . I've been the first to implement speex over RTP  
and I've contributed a little to the speex-over-rtp draft with Jean Marc 
Valin and Greg Herlein (especially concerning SDP usage specification)

While implementing theora support in linphone, I encoutered several major 
problems:

1/ about packed configuration header. This packed configuration header is 
supposed to be theora header followed immediately by theora tables. 
Unfortunately the current theora decoder is unable to decode such packed 
configuration (it stops after the header and ignores the table) and as far as 
I understand there's no way to retrieve where theora tables start when 
receiving such a packet.

-> as a consequence I've implemented differently: theora header and tables are 
sent in different packets.

2/ about fragment type. The draft defines 3 types: begin of packet, 
continuation of packet, and end of packet. I think this is really very 
redundant information: the receiver only needs to know the frontier between 
video frames, nothing more. Setting the marker bit of the rtp header to 1 for 
the last packet of a video frame is enough and much simple. RTP (RFC3550) 
tells it's up to payload specifications to indicate the meaning of this 
markbit. There's no problem in using it. RFC2429-bis (payload spec for 
H263-1998) does that.
Furthermore, for the fragmentation algorithm, it is painful to know whether a 
fragment is a end of packet or continuation packet. And what about if a 
packet isn't fragmented at all, ie it is as well a start and a end of a video 
frame ?
Note that the sequence number of the rtp header let the application detect 
incomplete frames.

-> I used the marker bit to indicate end-of-frames packet.

3/ I used inband sending of configuration headers. The inline SDP method has a 
big problem for me: it forces the SDP offerer to configure its theora encoder 
before even knowing about the bandwidth constraints of the remote side 
(expressed using the b=<AS>: field of SDP). 
The logical behaviour for me would be that each side expresses (using SDP and 
a possible a=fmtp line) its receiving preferences, for example

b=<AS>:64 
a=fmtp:99 QCIF=2 
(meaning:
limit to 64kbit/second
this device can only display QCIF 
pictures at framerate=(29,97/2=15) frames per second, as in RFC2429-bis)

Thus by taking account all those preferences, each theora encoder can be 
configured efficiently to fit the bandwidth requirements and the display 
constraints of the remote side. The theora packed configuration packets can 
then be sent inband (the method that I prefer), or through an alternate 
method: (http, RTCP packet?) , but ONLY AFTER the SDP messages have been 
exchanged.
For me it is very important to efficiently use bandwidth indications because 
for example with usual DSL connections the bandwidth is sometimes limited to 
128kbit/s (and very often in upload case). Doing CIF at 30 fps with high 
quality coding is not possible in this situation. I found theora codec is 
really efficient (CIF at 7 fps works with such DSL modems). But the 
prequesite for this to work is that the phone be able to configure its theora 
encoder after receiving the SDP message from the remote side.

Finally the format I've used in my implementation (see 
mediastreamer2/src/theora.c in linphone) can be sum-up like this:
- use the marker bit to indicate end of video frames packet
- use a payload header like this:
| 24 bits of config ident | 5 unused bits | 3 bits of packet type |
| theora data.................................|

The 3 bits of packet type can be:
#define THEORA_RAW_DATA	0
#define THEORA_HEADER_DATA 1
#define THEORA_COMMENT_DATA 2
#define THEORA_TABLES_DATA 3

I don't use the comment data.

At the start of the session, theora header are sent, then theora tables (that 
are fragmented since they are quite big). Those packets are sent 3 times to 
improve reliability in case of packet losses. Note: there are surely better 
approaches to improve reliability.
Then theora data is sent normally (THEORA_RAW_DATA).

Finally, I would expect about this draft to tell how to split a big theora 
frame in several mtu-sized packets in a way that would make a partially 
received frame usable by the decoder. In other words, how to be as safe as 
possible in case of packet losses. But I don't know whether this is something 
possible, I don't know enough about the internals of theora.

That's all for my comments. I just want to try to keep the world as simple as 
possible and bring my developer experience as well as my user-experience of 
video-telephony.
Despite I've made reference to RFC2429-bis (H263-1998) I don't consider this 
paper as an example to follow, I'm sure we can do better.
I don't want linphone to be an out-of-standarts video phone, so I would really 
like it to implement the draft you are working on. However I would really 
like that this future RFC to be as clear and simple as possible. I'm really 
bored with that obscure RFCs that sometimes go out from the IETF (ex: 
rfc2190, amr over rtp, mpeg4 over rtp...).
I think with a good RFC, theora would be really superior to MPEG4 in the real 
time streaming world.

Thanks a lot for reading this, I'm waiting for your feedbacks.
Also, I'd like to thank Mr. Barbato for all the work he has already done with 
this draft.

Simon