[xiph-rtp] Theora RTP payload format
acolwell at real.com
Mon Apr 18 10:02:26 PDT 2005
On Mon, Apr 18, 2005 at 12:19:52PM -0400, Steve Kann wrote:
> Aaron Colwell wrote:
> On Mon, Apr 18, 2005 at 11:30:26AM -0400, Steve Kann wrote:
> Hi, List,
> I've been working on building an implementation of a
> video-conferencing endpoint using Theora, and have been working with the
> draft-kerr-avt-theora-rtp-00 spec.
> I've also read the archives of this list, about some of the proposed
> changes. I'd like to describe here what I'm planning on doing, and see
> how this might fit into your design.
> Basically, what I'm working with is a project called "iaxclient".
> iaxclient is a library for a VoIP softphone, which presently supports
> only audio, but I am extending to support video as well. It uses the
> IAX2 protocol, which is a lightweight VoIP protocol that does _not_ use
> RTP. However, the payload format for IAX2 is generally compatible with
> the payload format for RTP. Asterisk (the open-source PBX) includes
> support for RTP-based VoIP protocols (SIP, H.323, etc), as well as
> non-RTP-based VoIP protocols (IAX2, others).
> There are basically two use cases for users making videoconferencing
> calls using the application:
> 1) Point-to-Point calls: This case seems to be pretty easy to
> handle, and fits into most of the designs I've seen so far:
> 2) Multi-party conferences: This is where some of the designs I've
> seen so far seem to work well, and some of them do not.
> The basic idea for multi-party conferences is that each user
> maintains a virtual connection to a "conference engine" (this is
> already in place for audio conferences). The conference engine
> intelligently receives audio from the clients and sends audio to the
> clients, so each client can hear the audio of any other speaking
> The idea for video is that the clients each send their video to the
> conference engine, and the conference engine will send zero or one video
> stream to each participant, in one of two "modes"
> a) Automatic mode: The conference engine will use some
> heurestics to decide whose video should be shown to the participant --
> Generally, this will be the only participant who is presently speaking
> (in the case of multiple active speakers, or zero active speakers, there
> will be some secondary criteria).
> b) Request mode: The client itself will notify the conference
> engine (perhaps out-of-band) and request to see a particular speaker's
> What this means for the video stream (and this works just fine for
> any other video format, (i.e. h.26x, etc), is that we would like to be
> able to change the video source at any time (or, at any keyframe at
> The whole setup headers business, of course, makes this design
> particularly difficult. With the present draft-kerr-avt-theora-rtp-00
> format, though, I think I could probably (with a great deal of
> unnecessary overhead), send the setup headers occassionally, and then
> switch at any time. The clients could then use "header caching", and,
> if they've seen these headers before (matching CRC32), they could use
> their cached copy, and if not, they'd just have to wait a few seconds to
> get them before they could start decoding.
> *Note: I also suspect, but I haven't researched, that if all the
> clients are using the same version of the theora encoder, and the same
> settings, that their setup headers would likely be the same; If this is
> the case, then their CRC32's would be the same, and they could start
> decoding at any keyframe..
> With the latest idea I've read, though, it makes this process much more
> inconvenient, because _each_ client would have their own 16bit "chain
> ID", and these chain ID's would be duplicated in the streams sent by
> each client, and therefore the server would need to deeply understand
> and parse each of the streams in order to put them together, etc.
> What did you do in the case where the CRC32 was different from each of the
> clients? This is basically the same scenario isn't it?
> In that case, the "setup header ident" of the payloads would be different,
> and the receiver would know that it needs to wait until it receives setup
> headers matching the CRC of these frames before decoding. So, the only
> time we could accidentally try to decode frames using an incorrect set of
> setup headers would be in the case of a CRC collision (P ~= 0).
> If the payload format only includes a "chain ID", then the chances of two
> streams having the same "chain ID" when coming from different sources is
> pretty much P==1. So, the server that's doing the switching would need to
> actually muck with the payload in order to give each sender a different
> Chain ID, and then keep track of which was which, etc. It makes it
> impossible to just switch senders in the server without the server
> understanding the internals of the codec payload.
Ok now I see.
> I know nothing about IAX2, but I would assume that it has some sort of
> offer/answer model to negotiate codec parameters and such. You could easily
> put the chain ID in this negotiation so that all users in the conference use
> the same codebook.
> Presently, it's pretty simple, where it allows negotiation of the codec,
> but not codec parameters. In practice, it hasn't been necessary to do
> that. In the future, it might need to be extended to do so.
I see. I suppose you could append the codebook hash to the codec name. Instead
of just "Theora" you could have "Theora-2982394872479842". I don't know if
there are any limitations on codebook name.
> But, consider that users will join and leave the conference at arbitrary
> times, so the conference engine can't know in advance all the codebooks
> that might be used.
This isn't necessarily a problem. It would just need to know the codebooks that
it allows to be used in the conference. That's basically what happens with all
other codecs. It's just implicit in their case instead of explicit like it
would have to be for Theora.
> Also, as you elude to below, there's no way to seed an encoder with a
> particular codebook (AFAIK).
> I think that my use case isn't all that unusual though; it's somewhat
> like the properties you might have in multicasting, I think.
> 1) It would be ideal if the RTP payload format could be made independent
> of SDP.
> It is currently independent of SDP if you use inline codebook transmission. The
> info in the SDP just allows you to know ahead of time what the info and setup
> headers are going to be for each chain. It also provides a mechanism to grab
> the codebooks ahead of time. You also can save bits if you don't want
> to periodically transmit codebooks.
> I don't think so, I thought the latest proposal called for replacing the
> "setup ident" field (32 bits) with a "chain ID" field (16 bits or so),
> where the "chain ID" field would refer to a "chain-info" item in the SDP.
The chain ID really just represents a group of packets that use a particular
ident & codebook pair. You don't have to refer to the SDP. If the client
knows that there is going to be inline header and codebook transmission it
can just wait for the ident and codebook for that chain to arrive. In most
situations those would arrive before any data packets, but in your switching
situation, that might not happen.
> This would mean, that even for a RTP and SDP based conference application
> like mine, if a client joined the conference with a different codebook,
> then all the clients would need to re-fetch the SDP in order to identify
> the codebook that's needed.
No it wouldn't. It can just wait for the inline headers to arrive just like it
would in the CRC32 case.
> But, I guess, I haven't seen (maybe it hasn't been written yet), how the
> inline codebook transfer would work.
The way I envision it is that ident and codebook packets would be transmitted
just like data packets. They would have a chain ID associated with them as
well. This allows you to determine what headers go with which stream.
> In general, I don't think that my issues are unique to IAX2; Nor do I
> think that they are things that can't be made to work with whatever format
> you have. But the questions are 1) how complex will these
> implementations need to be, and 2) how will they perform.
> Most video codecs have the property where a "switch" (which is basically
> what my conferencing application is), can "switch" between streams from
> different sources, at any keyframe, as long as the width, height, and
> framerates are the same (and in some cases even if they're not), without
> needing to negotiate with the receiver at all. This can make a switch
> fairly simple; It only needs to know, for each frame, whether it's a
> keyframe or not, and then treat the whole thing as opaque data.
In these cases the switch needs to know datatype specific info about the codec.
I'm assuming it cracks open the payload to determine the frame size and frame
rate. All Theora does is add codebook to the switch criteria. How are you
doing switching right now for Theora? The server would still need to keep state
for each client since the frame size & frame rate is not in the frame data.
Are you not enforcing that criteria for Theora? Is this something that is
determined at the time the client connects to the server?
> Theora has already moved away from this goal a bunch with the whole
> codebook thing, but it would be nice to at least minimize the
> inconvenience of dealing with the codebooks as much as possible.
> [the present theora rtp format exhibits this property; if you use
> periodic inline
> setup header transmission]
> 2) It would be ideal if the RTP payload format continued to allow inline
> setup header
> To my knowledge we weren't going to get rid of inline transmission. I had
> always intended to keep it.
> Would the format be the same as it is now (+- the setup header ident
> field)? Would there be some way outside of SDP to indicate which
> codebooks belonged to which "chain id?"
The ident and codebook packets would have a chain ID in them.
> It would be most convenient, if there were a "fixed setup" mode for
> theora, where you could ask the theora-encoder to use fixed setup header
> set, and have it act like other codecs in this respect. I understand
> the flexibility that the setup headers give you in encoder design, but
> it would be nice if there were a way to configure it otherwise..
> If the encoder allowed you to specify a codebook on initialization, you could
> effectively do this. Basically your app could just always specify the same
> codebook to the encoder and then sent the hash to the other participants.
> They would then verify that your hash matches the hash of their codebook and
> then your done. This is basically the codebook cache hit scenario. If you get
> a miss then you just make connection to the conference fail.
> Right. Something like this would allow for the most bit-efficient method,
> because we could (rarely, if ever) retransmit codebooks if we can control
> all the clients, and force them to use the same codebook.
> One of the other things I'll need to do eventually is to "record" these
> conferences, into some container, and make that format
> forward-compatible; If all the clients use the same codebooks, that also
> makes things much simpler, because we could write this all out as one
Thanks for bringing this up. It's nice to have input from a different use case.
I'm fine with making changes to the current thinking, but I just want to make
sure that we have a good understanding of the problems.
The whole reason that we went from the CRC32 -> chain ID thinking was because
there was concern about collisions in this value. Unique chain IDs fix that
problem, but cause a problem for you because you want a system that doesn't
have to worry about the chainID -> codebook mapping.
What isn't clear to me is whether your problem space actually needs to allow
arbitrary codebooks. It seems to me that allowing this causes more headaches
that it's worth since you could potentially waste a ton of bits on codebook
transmission if every client in the conference uses a slightly different
More information about the xiph-rtp