[foms] Proposal: adaptive streaming using open codecs

Pierre-Yves KEREMBELLEC pierre-yves.kerembellec at dailymotion.com
Tue Nov 16 10:29:21 PST 2010


>> That might be what some advocate, but what I would advocate is having just one file for each bitrate of video and a separate one for each bitrate or language of audio etc. and then provide the clients with an index into each file so they can make byte range requests for the pieces they need from each.
>> 
>> There does exist, in several CDNs, a simple server extension which enables a byte range to be embedded in a URL, instead of in the Range header, and we do use this with Apple clients for our service to avoid the "millions of files" problem. But this is just a different way of communicating the byte range to the server which happened already to exist and be useful as a workaround and which is very much an application-independent capability: what I would suggest we avoid is any video-specific server extensions, where servers are expected to understand the format of the video and audio files, re-multiplex them etc.
> 
> So it seems there's a general consensus on splitting up audio and video into separate streams? Who's really against it, for which reasons? This has big implications for the dummy Stream.appendChunk() call we were brainstorming about.
> Just appending chunks wouldn't work anymore; we'd basically have to create tracks and append chunks to tracks...
> 
> I'm also a little lost on how the files on the server would be structured. Would there be audio-only and video-only "plain" WebM files, or do we need to go to a "chained" format (range requests) or a "chunked" format (separate files)? In both
> latter cases, we'd loose adaptive streaming supporting current WebM files...

I think both Mark and Frank are totally right about separating the different tracks _before_ sending to the client, in order to minimize the
different combinations sent on the wire and maximize internet/browser cache efficiency. If you think about it, this pattern has been around
for years (if not decades) with RTSP, where audio and video are delivered separately within UDP RTP virtual streams (they may also be
delivered interleaved within the TCP RTSP connection itself to circumvent firewall problems, but this is another story).

Whether tracks are stored as separate files or extracted from an interleaved (muxed) file using a server-side extension is outside the scope
of this discussion IMHO, and only pertains to server-side performance.

That said, I think the best compromise would be for browsers to accept both interleaved and elementary streams: when there's only one
video track and one audio track (which probably covers 99% of the video content out there), it makes sense not to add complexity for
publishers and ask them to demux their videos (some probably don't know exactly what video internals are about anyway).

That's why I would propose the following :

- either interleaved or elementary streams sent from the server side

- multiple versions/qualities for the same content advertised using a "main" manifest file (not tied to the HTML markup because we
  probably want this to work outside browsers)

- multiple audio/video/<you-name-it> tracks for the same content also advertised in this main manifest file

- main manifest files may refer to streams directly, or to "playlist" manifest files (in case the publisher willingly choose to use fragments)

- playlist manifest files list all fragments with precise timestamps and duration (i.e. not "a-la-Apple-M3U8")

- JSON for all manifest files (as it's easy to parse on any platform and heavily extensible)

- all interleaved and elementary streams - chunked or not - are independently playable (and start with a RAP for video)

- the chosen container is easily "streamable", and require minimum headers overhead to pass codec initialization info

- RAP may not be aligned between versions (or at least it shouldn't be a strong requirement even if in practice it's often
   the case), thus end-user experience with no-glitch stream-switching would depend on renderer and double-decoding/buffering
   pipeline capabilities

Thoughts ?
Pierre-Yves



More information about the foms mailing list