[xiph-rtp] Theora RTP payload format
stevek at stevek.com
Mon Apr 18 08:30:26 PDT 2005
I've been working on building an implementation of a
video-conferencing endpoint using Theora, and have been working with the
I've also read the archives of this list, about some of the proposed
changes. I'd like to describe here what I'm planning on doing, and see
how this might fit into your design.
Basically, what I'm working with is a project called "iaxclient".
iaxclient is a library for a VoIP softphone, which presently supports
only audio, but I am extending to support video as well. It uses the
IAX2 protocol, which is a lightweight VoIP protocol that does _not_ use
RTP. However, the payload format for IAX2 is generally compatible with
the payload format for RTP. Asterisk (the open-source PBX) includes
support for RTP-based VoIP protocols (SIP, H.323, etc), as well as
non-RTP-based VoIP protocols (IAX2, others).
There are basically two use cases for users making videoconferencing
calls using the application:
1) Point-to-Point calls: This case seems to be pretty easy to
handle, and fits into most of the designs I've seen so far:
2) Multi-party conferences: This is where some of the designs I've
seen so far seem to work well, and some of them do not.
The basic idea for multi-party conferences is that each user
maintains a virtual connection to a "conference engine" (this is
already in place for audio conferences). The conference engine
intelligently receives audio from the clients and sends audio to the
clients, so each client can hear the audio of any other speaking
The idea for video is that the clients each send their video to the
conference engine, and the conference engine will send zero or one video
stream to each participant, in one of two "modes"
a) Automatic mode: The conference engine will use some
heurestics to decide whose video should be shown to the participant --
Generally, this will be the only participant who is presently speaking
(in the case of multiple active speakers, or zero active speakers, there
will be some secondary criteria).
b) Request mode: The client itself will notify the conference
engine (perhaps out-of-band) and request to see a particular speaker's
What this means for the video stream (and this works just fine for
any other video format, (i.e. h.26x, etc), is that we would like to be
able to change the video source at any time (or, at any keyframe at
The whole setup headers business, of course, makes this design
particularly difficult. With the present draft-kerr-avt-theora-rtp-00
format, though, I think I could probably (with a great deal of
unnecessary overhead), send the setup headers occassionally, and then
switch at any time. The clients could then use "header caching", and,
if they've seen these headers before (matching CRC32), they could use
their cached copy, and if not, they'd just have to wait a few seconds to
get them before they could start decoding.
*Note: I also suspect, but I haven't researched, that if all the
clients are using the same version of the theora encoder, and the same
settings, that their setup headers would likely be the same; If this is
the case, then their CRC32's would be the same, and they could start
decoding at any keyframe..
With the latest idea I've read, though, it makes this process much more
inconvenient, because _each_ client would have their own 16bit "chain
ID", and these chain ID's would be duplicated in the streams sent by
each client, and therefore the server would need to deeply understand
and parse each of the streams in order to put them together, etc.
I think that my use case isn't all that unusual though; it's somewhat
like the properties you might have in multicasting, I think.
1) It would be ideal if the RTP payload format could be made independent
[the present theora rtp format exhibits this property; if you use
setup header transmission]
2) It would be ideal if the RTP payload format continued to allow inline
It would be most convenient, if there were a "fixed setup" mode for
theora, where you could ask the theora-encoder to use fixed setup header
set, and have it act like other codecs in this respect. I understand
the flexibility that the setup headers give you in encoder design, but
it would be nice if there were a way to configure it otherwise..
More information about the xiph-rtp