[xiph-rtp] Chaining

Sun Aug 28 18:31:38 PDT 2005

Hi Luca,

Welcome to the discussion and thank you for taking on this work.  I would have
liked to do this myself, but I couldn't guarentee getting it done in a timely
manner. Thank you for stepping up to the challenge.

I've spent the weekend thinking over the need for chaining in the RTP spec. I
may be leaning towards dropping it as well. First, I'll outline the use cases 
that Real has typically used RTP for. Then I'll describe the Helix rate
adaptation since Ralph mentioned it in another email. Finally I'll outline
why we may not want to do chaining after all.

Use cases:
- On-demand playback of static files.

  * Files may be chained. Solution should provide a way to playback any valid
    .ogg file.

  * Since many clients can access the same file at different times, transcoding
    to a common set of parameters kills server scalability.

- Live broadcast from a camera and/or mic.
  * Usually the encoding parameters are static so chaining support is not
    needed. 

- Simulated Live broadcast. This is where you take a playlist of static files
  and broadcast them as a "live" stream.

  * On the output side you don't necessarily need to support chaining if
    you transcode on the input side. Transcoding might be an acceptable option
    since you only have to do it once independent of the number of clients
    connected to the stream.

- Forward channel only broadcast. This is basically a live or simulated live
  feed that has no backchannel. This means that HTTP requests for codebooks
  are not allowed. The main example of this would be satellite distribution.

Comments on rate adaptation:

I didn't read over all my historical comments about this topic before writing
this so please don't flame me if I contradict myself. This represents my
current thinking.

Here is how the Real/Helix system currently does rate adaptation. When encoding
content you select a set of bitrates for the audio and video. The encoder then
creates independent streams for each of the bitrates. An multi-rate A/V file
has 2 logical streams, one audio and one video. Each of those logical streams
has a set of physical streams, one for each bitrate. Each of the physical
streams are assigned a rule number.(Technically each physical stream has 2
rules 1 keyframe, and 1 non-keyframe, but that isn't overly important for this
discussion). Each logical stream has a "rule book" that tells the media engine
what rules to select when different connection bitrates are detected. The
rule book contains a set of expressions for each rule. These expressions are
evaluated periodically during playback and control which physical stream is 
sent. The Real/Helix adaptation mechanism was originally designed for our
proprietary RDT transport protocol, but we adapted it to RTP when we started
doing multicast transmission. Since we can switch physical streams at any point
during playback we needed a way to identify what physical stream the packet
data is associated with. In RDT we have a ruleID field for each packet. When
we stream over RTP we add an RTP header extension that contains this ruleID.
On the client side we use this ruleID to demultiplex the different bitstreams
and handle the stream switches. Historically most of our rate adaptation has
been client driven. The client would monitor the connection throughput and
send subscription change messages to change the physical streams. This allowed
to know when the proper time was to cross-fade between physical streams.

Whew.. did you get all that? Here are a few things to note about the physical
streams.

-  All codec configuration data for each physical stream is send in the SDP.
   This data only takes up a few hundred bytes max. This data usually contains
   data equivalent to the ident headers for the Xiph codecs. The codebooks
   are fixed.

- In the case of audio different codecs may be used. For low bitrates a
  voice codec may be used and a music codec could be used for higher bitrates.

- In the case of video the codec isn't different. Frame size is constant, but
  frame rate isn't necessarily.

- Codecs for all physical stream are initalized when the client receives the
  SDP so they are always ready when data arrives

Eventhough we have a system that works we aren't doing a few things the
"RTP way". This was done way before I was involved with the Helix code so
please don't flame the messenger. :) I'm mentioning these because I think
they are examples of things we shouldn't do for the Xiph RTP specs.

- We interleave different codecs into the same RTP session. This is a fuzzy 
  area of the RTP spec. Supposedly you can have multiple payload types in a
  single session, but I've never got a clear answer out of the IETF about
  what that is supposed to mean.

- We don't actually use the sample rate of the media data for the RTP
  timestamps. All our RA/RV streams over RTP use a 1000Hz clock. The only
  magic about this rate is that the core keeps track of time in milliseconds.
  Technically we should be using the audio sample rate for audio. For video we
  should probably be using 90000Hz like all the other video payload formats.

- We use an RTP header extension to transport our ruleID. Sure the extension
  is part of the spec, but I don't know of any other payload that uses it. We
  should have made it part of the payload.

- We use a non-standard SET_PARAMETER RTSP request to control physical stream
  selection.

Why we might want to revisit the chaining question:

I've been thinking about this quite a bit this weekend. I've also been thinking
about all the complexity we've talked about just to support chained files.
Is it really necessary? I'm not so sure anymore. 

One of my main arguments in the past is that I wanted an on-demand server to
be able to deliver any valid .ogg file over RTP. I focused mainly on trying to
cram the chained streams into RTP sessions. I think all the complexity that
I/we created was a sign that we were trying to fit a round peg into a square
hole. What if a request for a chained file returned a playlist of URLs that
represented the various chain segments? The player could then take this
playlist and request the URLs for each segment. Most players out there I'm 
pretty sure can handle playlists properly. It also fixes several problems
associated with chained files. It makes it MUCH easier for the player to deal
with files that have a different number of streams in each chain. My Helix
plugins handle this case, but it means having figure out the max number of
audio and video streams across the whole file, create RTP sessions for that 
worst case, and then dynamically map streams to RTP sessions during playback.
It's a pain. If you could just expose the chains as seperate URLs, then you
can have a relatively simple implementation on the client and server and
leverage the players existing playlist functionality. 

Another argument I had for chaining was for supporting the simulated live
case. Since you could have chained files or files with different codebooks,
I believe that this required chained file support. I'm starting to believe that
perhaps transcoding is a better solution here. You don't have to worry about
scalability as much since you only have to do 1 transcode for each simulated
live stream, not 1 per listener. If you were going to have a ton of simulated
live streams then perhaps it makes more sense to unify your content to use a 
single codebook. It doesn't seem fair that the transport layer should have to
shoulder the burden of content author laziness.

I think at some point I had a rate adaptation argument too. I'm a little more
familiar with Theora so I'll start with that. With the current codebooks that
ship with the encoder you could effectively do rate adaptation without the
need to change codebooks. You can basically do one of 2 things. You can drop
frames so that you have a lower frame rate. The client would be able to figure
out what is happening by seeing that there aren't any lost packets and the 
timestamps are farther apart. You can also just lower the Q being used. This
increases the quatization which will lower the bitrate. I believe derf_
mentioned these facts before, but I don't remember. I don't really know
how the Vorbis code works so I don't know if a similar mechanism can be
exploited. I admit that these mechanisms may not be able to provide the most
optimal rate control, but I think it provides something reasonable for now.

If we allow chaining we also will have to solve the problem where the 
sample-rate / frame-rate of one of the chains is not an even multiple of the
RTP timestamp sample rate. This can lead to all sorts of rounding headaches.
We have tons of code in Helix dealing with this. It isn't fun.

I realize this is almost a complete 180 for me. I'm sorry if I was the sole
cause of this delay. I also apologize to anyone who pointed this stuff out and
I didn't see it's truth at the time. I would be fine with persuing a 
non-chaining RTP spec. Here are the only things that I would suggest we
insure are in the spec.

- Allow inline transmission of the info header. This is to allow TAC changes
  in a live/simulated live scenario.

- Allow inline transmission of the ident and codebook headers. This is mainly
  to support forward link only scenarios.

- Allow for a "chainID" field. Basically I'd like a bit that signals the
  presence of the field. If the bit is set a chainID field will be present.
  I'm fine with 16 or 24 bits for this field. The main idea here is to allow
  for chaining support to be added later. If you don't want to have the field
  in there then just make sure that there is at least 1 bit that is reserved
  for this purpose. If you do decide to have the field then there should be
  text indicating that says "If the chainID field is present then it must
  always be 0 to comply with this spec." 

Sorry for the marathon email. I just wanted to get all my current thinking out there.

Aaron