[foms] Proposal: adaptive streaming using open codecs

Wed Dec 15 05:04:23 PST 2010

Hey Mark,

There's indeed both the headers and the index. I'd say the headers are not exposed (mostly of interest to the decoder) to javascript, and that the fragment representation in JS is already index-mapped (i.e. in javascript you get time offsets and not byte ranges). There's no way to grab an individual fragment á la what a websocket could do. 

Exposing the manifest / track / stream info to javascript is a good idea, because: 

*) It will allow webdesigners to setup a dynamic stream inline (simple, no HTTP delay) and dynamic (e.g. using JS).
*) It will allow webdevelopers to influence / override switching heuristics. 

Perhaps a first version of browser switching heuristics is: "grab the first stream I can play". This is how the logic around <source> tags works. Similar to subtitles and captions tracks, the browser's built-in video controlbars can then display a dropdown for switching to another stream (exactly like how YT works). 

Now, if streams / cues are exposed to javascript, the javascript layer can actually tell the player to switch to another stream, thereby implementing the "adaptive" part of adaptive streaming:

  videoElement.tracks[0].setStream(3) // Switch to stream 3 for the first track (video) at the browser's convenience.
  videoElement.track[0].setStream(3,true) // Switch to stream 3 for the first track (video) right now.

Kind regards,

Jeroen

On Dec 14, 2010, at 10:07 PM, Mark Watson wrote:

> I would just add one thing to this which is that a [stream] can consist of [header], [index], [fragments]. By [index] I mean some kind of time range -> byte range mapping. For simplicity we could restrict to two cases:
> 
> (i) [header][index][fragments] are all packed into one file (on-demand case)
> (ii) [header] and each [fragment] are in separate files and there is no index, just a list of URLs (live case)
> 
> One question: do we really need to expose this all in HTML5 ? I had been assuming there would be a manifest file and the complexity of tracks and streams would be hidden within this. So then the adaptive streaming part would not be HTML5-specific. All we need to expose is the choices that can't be made automatically (languages etc.).
> 
> ...Mark
> 
> 
> On Dec 14, 2010, at 3:44 AM, Jeroen Wijering wrote:
> 
>> Hello,
>> 
>> We've had various discussions (like below one) on the data model to use for adaptive streaming. We also talked about adaptive streaming needing to work without a manifest in HTML5. Now I think everybody would agree the data format for an adaptive stream is quite straightforward: 
>> 
>> [manifest]
>>   [track]
>>       [stream]
>>           [fragment]
>> 
>> The manifest is the entire presentation; the track represents a single layer (video, audio, image, text); the stream represents a single rendition of the layer (at 200 kbps or 1500 kbps); the fragment represents a single access unit (stream GOP, text Cue). Correct? 
>> 
>> 
>> Representing this info in HTML5 is a bit difficult, since there's already the <source> tag (which is basically an audio+video [track]). On top of that, the <track> tag might get hijacked to only be used for text data (captions, cuepoints and such). Instead, it looks like we should look at having the <track> tag represent any type of track in a presentation (not just the texts). Example:
>> 
>> <video width="640" height="360" controls>
>>   <track kind="video" type="video/webm">
>>       <stream src="video-300.webm" width="240" bitrate="300">
>>       <stream src="video-900.webm" width="640" bitrate="900">
>>   </track>
>>   <track kind="audio" src="audio.webm" type="audio/webm" lang="en">
>>   <track kind="audio" src="french.webm" type="audio/webm" lang="fr">
>>   <track kind="video" type="video/mp4">
>>       <stream src="video-300.mp4" width="240" bitrate="300">
>>       <stream src="video-900.mp4" width="640" bitrate="900">
>>   </track>
>>   <track kind="audio" src="audio.m4a" type="audio/webm" lang="en">
>>   <track kind="audio" src="french.m4a" type="audio/webm" lang="fr">
>>   <track kind="captions" lang="en" src="captions-en.srt">
>>   <track kind="captions" lang="fr" src="captions-fr.srt">
>>   <!-- Fallbacks (no captions, no adaptive, no french audio) -->
>>   <source type="video/mp4" src="video.mp4">
>>   <source type="video/ogg" src="video.ogg">
>> </video>
>> 
>> Is this totally outrageous, or could this work (give the state of <track>)? You can see the current cuepoints API get extended to also work for media fragments (cuepoints would represent HTTP chunks / GOP boundaries / switching points).
>> 
>> 
>> With such a model (again still ignoring the entire API), current content could be deployed as adaptive streams (if we want to offer this - perhaps we just deny interleaved streams in tracks):
>> 
>> <video width="640" height="360" controls>
>>   <!-- Adaptive stream -->
>>   <track kind="audio+video" type="video/webm">
>>       <stream src="video-300.webm" width="240" bitrate="300">
>>       <stream src="video-900.webm" width="640" bitrate="900">
>>   </track>
>>   <!-- Progressive stream -->
>>   <source src="video-300.webm" type="video/webm">
>> </video>
>> 
>> 
>> Kind regards,
>> 
>> Jeroen
>> 
>> 
>> 
>> 
>> On Nov 17, 2010, at 6:09 PM, Mark Watson wrote:
>> 
>>> What is more important than the specific syntax (JSON, XML, M3U8 etc.) is the data model or abstract syntax. Once you have that you can map pretty easily to any specific syntax. It would be good to discuss what is needed at that level first.
>>> 
>>> Roughly, something like the following:
>>> - some notion of <track> which is composed of one or more <stream>s (the specific terms can be changed).
>>> - A <track> is a single media type or interleaved combination of multiple types. The <stream>s it contains are different encodings of the exact same source media (so different audio languages are different <tracks>). If the <track> contains multiple media types every <stream> has all those media types interleaved i.e. all <stream>s contained interleaved media or no <stream>s contain interleaved media within a <track>. In the second case all the <stream>s in the track contain the same single media type.
>>> - a way to annotate both <track>s and <stream>s with their properties for selection purposes: file format, media type(s), codecs/profiles, language, video resolution, pixel aspect ratio, frame rate, bandwidth, accessibility type (e.g. audio description of video, sign language) or other track type info (e.g. directors commentary) (Maybe this is too many, but annotations like this are cheap - clients just ignore tracks with annotations they do not understand). If all the <stream>s in a track have the same value for an annotation then you can annotate the <track> otherwise annotate the <stream> (that is just an optimization).
>>> - access information for each <stream>. EITHER
>>> 	(i) a URL for a single file containing the whole stream, including stream headers and an index, OR
>>> 	(ii) a URL for the stream headers and a list of URLs for the chunks and timing information for the chunks (could be constant chunk size)
>>> By stream headers I mean initialization information that applies to the whole stream.
>>> 
>>> Some additional details:
>>> - we've discussed various ways that chunks (however obtained) are aligned/can be concatenated or not. Additional <track> annotations are needed (IMO) to tell the player what properties the streams have in terms of chunk alignment, RAP positions etc. (Compatibility in terms of codecs etc. should be clear from the annotations)
>>> - you might want to use templates instead of long URL lists (as per another mail thread). If you do use long URL lists, you might want to be to store them in a separate files ("submanifests").
>>> - wherever a URL appears, it's useful (and for our service essential) to be able to provide a list of alternative URLs (in the same way DNS provides a list of alternative IP addresses). We use this for redundancy across CDNs.
>>> - how you find the headers and index in case (i) and their format may be file format dependent.
>>> - if the group wants to choose a single option between (i) and (ii) then I would obviously recommend (i). But you might want to include both options.
>>> - content protection is a whole big topic which may not be of so much concern to this group. But the manifest aspects can be handled easily with additional annotations.
>>> 
>>> If there is support for this kind of "data model" approach, then I'd be happy to write up a more detailed proposal based on whatever discussion is triggered by the above.
>>> 
>>> ...Mark
>> _______________________________________________
>> foms mailing list
>> foms at lists.annodex.net
>> http://lists.annodex.net/cgi-bin/mailman/listinfo/foms
>> 
> 
> _______________________________________________
> foms mailing list
> foms at lists.annodex.net
> http://lists.annodex.net/cgi-bin/mailman/listinfo/foms