[foms] Proposal: adaptive streaming using open codecs
jeroen at longtailvideo.com
Wed Oct 20 05:24:28 PDT 2010
On Oct 19, 2010, at 5:53 PM, Philip Jägenstedt wrote:
>> Every chunk should be a valid video file (header, videotrack, audiotrack). Every chunk should also contain at least 1 keyframe (at the start). This implies every single chunk can be played back by itself.
>> Beyond validity, the amount of metadata should be kept as small as possible (single-digit kbps overhead).
>> Codec parameters that can vary between the different quality levels of an adaptive stream are:
>> * The datarate, dimensions (pixel+display) and framerate of the video track.
>> * The datarate, number of channels and sample frequency of the audio track.
>> In order for quality level switches to occur without artifacts, the start positions of all chunks should align between the various quality levels. If this isn't the case, user-agents will display artifacts (ticks, skips, black) when a quality level switch occurs. Syncing should not be a requirement though. This will allow legacy content to be used for dynamic streaming with little effort (e.g. remuxing or using a smart server) and little issues (in practive, most keyframes are aligned between different transcodes of a video).
>> In its most low-tech form, chunks can be stored as separate files-on-disc on a webserver. This poses issues around transcoding (no ecosystem yet) and file management (not everybody loves 100s of files). A small serverside module can easily fix these issues:
>> * User-agent requests from chunks are translated to byte-range requests inside a full video.
>> * The data is pulled from the video.
>> * The data is wrapped into a valid WebM file.
>> * (The resulting chunk is cached locally.)
> If the decoder is unaware that there are chunks to begin with, the requirements don't need to be this strict. It would just as well be possible to have bare slices of data and making sure that a full header is provided only when switching streams. Basically, it would be up to the web author, but the above would work if we treat it as chained WebM.
Yes, good idea. I replied on this in a separate email. Basically, if somebody else would be interested in writing use cases / draft proposal for this part, that'd be great. I know too little in this area...
>> The M3U8 manifest format that Apple specified (http://tools.ietf.org/html/draft-pantos-http-live-streaming-04) is adopted. Generally, both an overall manifest (linking to the various quality levels) and a quality level manifest (linking to the various stream levels) are used. (Though, especially for live streaming, a single quality level may be used).
>> Here's an example of such an overall manifest. It specifies three quality levels, each with its own datarate, codecs and dimensions:
>> Here's an example manifest for one such quality level. It contains a full URL listing of all chunks for this quality level:
>> The video framerate, audio sample frequency and number of audio channels cannot be listed here according to the specs. Both in WebM and in MPEG-TS (the container Apple specifies), this can be retrieved during demuxing.
>> The M3U8 playlist format can be used to provide *sliding windows* (for livestreaming). Additionally, regular ID3 tags can be used to enrich the manifest with metadata.
> The constant re-fetching of a manifest is quite unappealing and not something I'd be very happy to build in as a part of <video>. This is quite an easy problem to solve, and I'd be happy to let developers roll their own, perhaps one of:
> * ever-increasing number
> * JSON manifest
> * URL of next chunk being delivered via WebSockets (in the future, the data itself could be as well, but that's certainly not for dumb servers)
I think having a small "Manifest API", like Christopher proposed, would solve this issue indeed. I'll try to draft up a proposal for that as well. You have any ideas?
Again do note that on-demand manifests will never be re-fetched. Re-fetchign only occurs if there's no #EXT-X-ENDLIST tag present. This typically is only (and very) useful for live.
>> The root manifest serves as the single, unique reference point for a adaptive stream. Therefore, user agents need solely its URL to playback the stream.
>> Here's an example for loading a root manifest: through the *src* attribute of the <video> tag in an HTML page:
>> <video width="480" height="270" src="http://example.com/video.m3u8">
>> <a href="http://example.com/video_low.webm">Download the video</a>
>> In this variation, the manifest is loaded through the <source> tag, to provide fallback logic:
>> <video width="480" height="270" >
>> <source src="http://example.com/video.m3u8" type="video/m3u8">
>> <source src="http://example.com/video_low.webm" type="video/webm">
>> <a href="http://example.com/video_low.webm">Download the video</a>
>> Here's another example for loading the manifest; through the *enclosure* element in an RSS feed:
>> <rss version="2.0">
>> <title>Example feed</title>
>> <description>Example feed with a single adaptive stream.</description>
>> <title>Example stream</title>
>> <enclosure length="1487" type="video/m3u8"
>> url="http://example.com/video.m3u8" />
>> Like the manifest parsing, the switching heuristics are upon the user-agent. They can be somewhat of a *secret sauce*. As a basic example, a user-agent can select a quality level if:
>> * The *bitrate* of the level is < 90% of the server » client *downloadRate*.
>> * The *videoWidth* of the level is < 120% of the video element *width*.
>> * The delta in *droppedFrames* is < 25% of the delta in *decodedFrames* for this level.
>> Since droppedFrames are only known after a level has started playing, it is generally only a reason for switching down. Based upon the growth rate of droppedFrames, a user-agent might choose to blacklist the quality level for a certain amount of time, or discard it altogether for this playback session.
>> The quality level selection occurs at the start of every chunk URL fetch. Given an array of levels, the user-agent starts with the highest quality level first and then walks down the list. If the lowest-quality level does not match the criteria, the user-agent still uses it (else there would be no video).
>> A user-agent typically tries to maintain X (3, 10, 20) seconds of video ready for decoding (buffered). If less than X seconds is available, the user-agent runs it quality level selection and requests another chunk.
>> There is a tie-in between the length of a chunk, the bufferLenght and the speed with which a user-agent adapts to changing conditions. For example, should the bandwidth drop dramatically, 1 or 2 high-quality chunks will still be played from buffer before the first lower-quality chunk is shown. The other way around is also true: should a user go fullscreen, it will take some time until the stream switches to high quality. Lower bufferLenghts increase responsiveness but also increase the possiblity of buffer underruns.
Yes, that's a good idea for first steps.
>> The video element provides accessors for retrieving quality of service metrics:
>> * *downloadRate*: The current server-client bandwidth (read-only).
> This is already available in an obscure form in some browsers via the buffered attribute. If a lot of people need it, we could expose it of course, but then preferably as a seconds/second metric, to match the rest of the API.
The pro of "downloadRate" (or "bandwidth") to a seconds/second mechanism is that you can compare it to the bitrates of streams that are NOT playing. In other words: my bandwidth dropped so I cannot play level 1 anymore. But should I switch to level 2, 3 or 4?
With a "seconds/second" mechanism, you'd always be calculating it back to the bandwidth and then compare it to bitrates.
>> * *decodeRate*: The current level's bitrate (read-only).
> What's this? The number of frames already decoded but not yet rendered?
It is the (audio+video) bitrate of the currently playing chunk. It dupes to the info in the manifest, but I added it for completeness. Chris Double implemented it like this for Mozilla .
It could be renamed to something like "bitrate" for consistency, or removed.
>> * *droppedFrames*: The total number of frames dropped for this playback session (read-only).
>> * *decodedFrames*: The total number of frames decoded for this playback session (read-only).
> Yep, what Firefox has. Is this the metric you prefer? MY guess is that you'd be more interested in the performance around now (a window of X seconds) than globally, especially when the video stream has switched from low to high quality or vice versa.
This metric is more "raw", so there's more to do with it. For example, I could setup my own sliding window and calculate the droppedFPS with it, e.g. for blacklisting heuristics with longer half-life times. I could also ignore sudden spikes in the total number of droppedFrames (probably CPU was shortly working on something else), etc.
In most cases, both droppedFPS and currentFPS would be enough (compare; then blacklist if more than X% is dropped), so we could use this instead. Chris Double then has to update his patch though  ...
>> * *height*: The current height of the video element (already exists).
>> * *videoHeight*: The current height of the videofile (already exists).
>> * *width*: The current width of the video element (already exists).
>> * *videoWidth*: The current width of the videofile (already exists).
>> In addition to this, the video element provides access to the stream levels:
>> * *currentLevel*: The currently playing stream level.
>> * *levels*: An array of all stream levels (as parsed from the manifests). Example:
>> bitrate: 100000,
>> codecs: 'vp8,vorbis',
>> duration: 132,
>> height: 180,
>> url: manifest_100.m3u8,
>> width: 240
>> bitrate: 500000,
>> codecs: 'vp8,vorbis',
>> duration: 132,
>> height: 360,
>> url: manifest_500.m3u8,
>> width: 640
>> In addition to this, the video element provides an event to notify scripts of changes in the current stream level:
>> * *levelChange*: the currentLevel attribute has just been updated.
>> Last, the video element provides functionality to override the user agent's built-in heuristics:
>> * *setLevel(level)*: This method forces the user to switch to another stream level. Invoking this method disables a user-agent's adaptive streaming heuristics. Use *setLevel(-1)* to enable heuristics again.
> So, this is where I'm not entirely supportive. Keeping track of several different streams in the same <video> element becomes a bit messy, as the state of HTMLMediaElement then becomes a bit weird. How would one interpret the buffered ranges, videoWidth, videoHeight, etc, when these will be different for the different streams? Letting the video element pretend that there's just a single infinite stream would be simpler, in this regard.
It is the core of easy-to-use adaptive streaming though: the browser parses the manifest, tracks the different streams and applies switching heuristics. This is indeed a huge amount of work.
Regardless, items like videoWidth / videoHeight / currentTime / duration would indeed change mid-stream as e.g. the user manipulates the manifest or the manifest contains chunks of different dimensions. A single infinite stream makes total sense, since it discard the need to update e.g. currentTime/duration.
bufferedRanges would not be needed for a dynamic streaming proposal, since a seek always implies a re-evaluation of heuristics and re-fetching of chunks (perhaps at different bitrates if conditions changed). Having a single entry in bufferedRanges (the current buffer) would be nice. That way, scripts can compare the desired buffersize (see next paragraph) with the current one.
>> * *bufferLength*: This attribute controls how much videodata (in seconds) a user-agent should strive to keep buffered.
>> An important example for *bufferLenght*: a website owner might set this to a very high value to enable viewers on a low bandwidth to wait for buffering and still see a high-quality video.
> Right, this would be useful in general as well, I think, and it's a magic constant which exists somewhere inside the browser anyway if it tries to conserve bandwidth at all.
Yes indeed. The use case of "I have a low connection but want to wait for the HD to load" is huge. Allowing the bufferlength to be set solves these issues.
>> Finally, some rationale for the choices made in this proposal. Why chunks and a manifest? Why not, for example, range-requests and <source> tags?
>> First and foremost, we need a format that works not only in HTML5 browsers, but also in, for example, mobile apps (Android/Blackberry/iOS), desktop players (Miro/Quicktime/VLC) and big screen devices (Roku, Boxee, PS3). Especially for the very small screens (3G network) and large screens (full HD), adaptive streaming is incredibly valuable. Tayloring a solution too much towards the HTML5 syntax and browser environment will hinder broad adoption of an open video standard. Adaptive streaming and HTML5 should work nice together, but adaptive streaming should not be relying on HTML5.
>> That said:
>> * Providing the low-tech scenario of storing chunks as separate files on the webserver enables adaptive streaming in cases where either the server, the user-agent (apps / players / settops) or the network (firewalls, cellulars) does not support something like range-requests. As an example, implementing adaptive streaming using range-requests in Adobe Flash (e.g. as temporary fallback) would not be possible, since the range-request header is blocked.
> Have you seen this problem a lot? As you know, all browsers implementing <video> use range requests for seeking. So far, I haven't seen any problems reported with it. That's not to say that there are no problems, it's just that there's not a lot of <video> content out there yet.
With player support, I heard of only one or two cases where range requests were blocked, or the server was misconfigured. Not a big deal.
I think the bigger issue is devices (user-agents). Flash is one example, but we also just worked a little on an Android video app. Getting range requests out of it was tricky depending upon the HTTPRequest lib you used, and it required some investigation. I'm afraid not all mobile / settop systems will implement range requests.
>> * Ecosystem partners (CDNs, encoding providers, landmark publishers, etc) are already getting used to ánd building tools around the concept of *chunked* video streams. Examples are log aggregators that roll up chunks servings into a single logline, or encoders that simultaneously build multiple stream levels, chunk them up and render their manifests.
> Having said "no" to so much, I should contribute something positive as well... Apart from disagreeing on how much should go into the browser, I think we all agree that the lower-level building blocks *should* go into the browser.
> What I'm proposing is that the lower-level API be one that allows multiple URLs to be treated as a single stream of bytes from the demuxers perspective. The Stream API  certainly has a suitable name for it, so perhaps it could be hijacked for this purpose.
>  http://www.whatwg.org/specs/web-apps/current-work/multipage/commands.html#stream-api
Yes, I totally get your point and I think this should be the first step as well. It will dramatically decrease the work that needs to be done for browser vendors (though still not insignificant ;). Again, a spec for the "Stream API" (better than "Manifest API"?) is a much needed next step.
More information about the foms