[foms] Proposal: adaptive streaming using open codecs

Tue Dec 14 03:44:49 PST 2010

Hello,

We've had various discussions (like below one) on the data model to use for adaptive streaming. We also talked about adaptive streaming needing to work without a manifest in HTML5. Now I think everybody would agree the data format for an adaptive stream is quite straightforward: 

[manifest]
    [track]
        [stream]
            [fragment]

The manifest is the entire presentation; the track represents a single layer (video, audio, image, text); the stream represents a single rendition of the layer (at 200 kbps or 1500 kbps); the fragment represents a single access unit (stream GOP, text Cue). Correct? 

Representing this info in HTML5 is a bit difficult, since there's already the <source> tag (which is basically an audio+video [track]). On top of that, the <track> tag might get hijacked to only be used for text data (captions, cuepoints and such). Instead, it looks like we should look at having the <track> tag represent any type of track in a presentation (not just the texts). Example:

<video width="640" height="360" controls>
    <track kind="video" type="video/webm">
        <stream src="video-300.webm" width="240" bitrate="300">
        <stream src="video-900.webm" width="640" bitrate="900">
    </track>
    <track kind="audio" src="audio.webm" type="audio/webm" lang="en">
    <track kind="audio" src="french.webm" type="audio/webm" lang="fr">
    <track kind="video" type="video/mp4">
        <stream src="video-300.mp4" width="240" bitrate="300">
        <stream src="video-900.mp4" width="640" bitrate="900">
    </track>
    <track kind="audio" src="audio.m4a" type="audio/webm" lang="en">
    <track kind="audio" src="french.m4a" type="audio/webm" lang="fr">
    <track kind="captions" lang="en" src="captions-en.srt">
    <track kind="captions" lang="fr" src="captions-fr.srt">
    <!-- Fallbacks (no captions, no adaptive, no french audio) -->
    <source type="video/mp4" src="video.mp4">
    <source type="video/ogg" src="video.ogg">
</video>

Is this totally outrageous, or could this work (give the state of <track>)? You can see the current cuepoints API get extended to also work for media fragments (cuepoints would represent HTTP chunks / GOP boundaries / switching points).

With such a model (again still ignoring the entire API), current content could be deployed as adaptive streams (if we want to offer this - perhaps we just deny interleaved streams in tracks):

<video width="640" height="360" controls>
    <!-- Adaptive stream -->
    <track kind="audio+video" type="video/webm">
        <stream src="video-300.webm" width="240" bitrate="300">
        <stream src="video-900.webm" width="640" bitrate="900">
    </track>
    <!-- Progressive stream -->
    <source src="video-300.webm" type="video/webm">
</video>

Kind regards,

Jeroen

On Nov 17, 2010, at 6:09 PM, Mark Watson wrote:

> What is more important than the specific syntax (JSON, XML, M3U8 etc.) is the data model or abstract syntax. Once you have that you can map pretty easily to any specific syntax. It would be good to discuss what is needed at that level first.
> 
> Roughly, something like the following:
> - some notion of <track> which is composed of one or more <stream>s (the specific terms can be changed).
> - A <track> is a single media type or interleaved combination of multiple types. The <stream>s it contains are different encodings of the exact same source media (so different audio languages are different <tracks>). If the <track> contains multiple media types every <stream> has all those media types interleaved i.e. all <stream>s contained interleaved media or no <stream>s contain interleaved media within a <track>. In the second case all the <stream>s in the track contain the same single media type.
> - a way to annotate both <track>s and <stream>s with their properties for selection purposes: file format, media type(s), codecs/profiles, language, video resolution, pixel aspect ratio, frame rate, bandwidth, accessibility type (e.g. audio description of video, sign language) or other track type info (e.g. directors commentary) (Maybe this is too many, but annotations like this are cheap - clients just ignore tracks with annotations they do not understand). If all the <stream>s in a track have the same value for an annotation then you can annotate the <track> otherwise annotate the <stream> (that is just an optimization).
> - access information for each <stream>. EITHER
> 	(i) a URL for a single file containing the whole stream, including stream headers and an index, OR
> 	(ii) a URL for the stream headers and a list of URLs for the chunks and timing information for the chunks (could be constant chunk size)
> By stream headers I mean initialization information that applies to the whole stream.
> 
> Some additional details:
> - we've discussed various ways that chunks (however obtained) are aligned/can be concatenated or not. Additional <track> annotations are needed (IMO) to tell the player what properties the streams have in terms of chunk alignment, RAP positions etc. (Compatibility in terms of codecs etc. should be clear from the annotations)
> - you might want to use templates instead of long URL lists (as per another mail thread). If you do use long URL lists, you might want to be to store them in a separate files ("submanifests").
> - wherever a URL appears, it's useful (and for our service essential) to be able to provide a list of alternative URLs (in the same way DNS provides a list of alternative IP addresses). We use this for redundancy across CDNs.
> - how you find the headers and index in case (i) and their format may be file format dependent.
> - if the group wants to choose a single option between (i) and (ii) then I would obviously recommend (i). But you might want to include both options.
> - content protection is a whole big topic which may not be of so much concern to this group. But the manifest aspects can be handled easily with additional annotations.
> 
> If there is support for this kind of "data model" approach, then I'd be happy to write up a more detailed proposal based on whatever discussion is triggered by the above.
> 
> ...Mark