[foms] Proposal: adaptive streaming using open codecs
jeroen at longtailvideo.com
Mon Oct 18 04:58:19 PDT 2010
Here is a (rough and incomplete) proposal for doing adaptive streaming using open video formats. WebM is used as an example, but all points should apply to Ogg as well. Key components are:
* Videos are served as separate, small chunks.
* Accompanying manifest files provide metadata.
* The user-agent parses manifests and switches between stream levels.
* An API provides QOS metrics and enables custom switching logic.
What do you think of this approach - and its rationale? Any technical issues (especially on the container side) or non-technical objections?
Every chunk should be a valid video file (header, videotrack, audiotrack). Every chunk should also contain at least 1 keyframe (at the start). This implies every single chunk can be played back by itself.
Beyond validity, the amount of metadata should be kept as small as possible (single-digit kbps overhead).
Codec parameters that can vary between the different quality levels of an adaptive stream are:
* The datarate, dimensions (pixel+display) and framerate of the video track.
* The datarate, number of channels and sample frequency of the audio track.
In order for quality level switches to occur without artifacts, the start positions of all chunks should align between the various quality levels. If this isn't the case, user-agents will display artifacts (ticks, skips, black) when a quality level switch occurs. Syncing should not be a requirement though. This will allow legacy content to be used for dynamic streaming with little effort (e.g. remuxing or using a smart server) and little issues (in practive, most keyframes are aligned between different transcodes of a video).
In its most low-tech form, chunks can be stored as separate files-on-disc on a webserver. This poses issues around transcoding (no ecosystem yet) and file management (not everybody loves 100s of files). A small serverside module can easily fix these issues:
* User-agent requests from chunks are translated to byte-range requests inside a full video.
* The data is pulled from the video.
* The data is wrapped into a valid WebM file.
* (The resulting chunk is cached locally.)
The M3U8 manifest format that Apple specified (http://tools.ietf.org/html/draft-pantos-http-live-streaming-04) is adopted. Generally, both an overall manifest (linking to the various quality levels) and a quality level manifest (linking to the various stream levels) are used. (Though, especially for live streaming, a single quality level may be used).
Here's an example of such an overall manifest. It specifies three quality levels, each with its own datarate, codecs and dimensions:
Here's an example manifest for one such quality level. It contains a full URL listing of all chunks for this quality level:
The video framerate, audio sample frequency and number of audio channels cannot be listed here according to the specs. Both in WebM and in MPEG-TS (the container Apple specifies), this can be retrieved during demuxing.
The M3U8 playlist format can be used to provide *sliding windows* (for livestreaming). Additionally, regular ID3 tags can be used to enrich the manifest with metadata.
The root manifest serves as the single, unique reference point for a adaptive stream. Therefore, user agents need solely its URL to playback the stream.
Here's an example for loading a root manifest: through the *src* attribute of the <video> tag in an HTML page:
<video width="480" height="270" src="http://example.com/video.m3u8">
<a href="http://example.com/video_low.webm">Download the video</a>
In this variation, the manifest is loaded through the <source> tag, to provide fallback logic:
<video width="480" height="270" >
<source src="http://example.com/video.m3u8" type="video/m3u8">
<source src="http://example.com/video_low.webm" type="video/webm">
<a href="http://example.com/video_low.webm">Download the video</a>
Here's another example for loading the manifest; through the *enclosure* element in an RSS feed:
<description>Example feed with a single adaptive stream.</description>
<enclosure length="1487" type="video/m3u8"
Like the manifest parsing, the switching heuristics are upon the user-agent. They can be somewhat of a *secret sauce*. As a basic example, a user-agent can select a quality level if:
* The *bitrate* of the level is < 90% of the server » client *downloadRate*.
* The *videoWidth* of the level is < 120% of the video element *width*.
* The delta in *droppedFrames* is < 25% of the delta in *decodedFrames* for this level.
Since droppedFrames are only known after a level has started playing, it is generally only a reason for switching down. Based upon the growth rate of droppedFrames, a user-agent might choose to blacklist the quality level for a certain amount of time, or discard it altogether for this playback session.
The quality level selection occurs at the start of every chunk URL fetch. Given an array of levels, the user-agent starts with the highest quality level first and then walks down the list. If the lowest-quality level does not match the criteria, the user-agent still uses it (else there would be no video).
A user-agent typically tries to maintain X (3, 10, 20) seconds of video ready for decoding (buffered). If less than X seconds is available, the user-agent runs it quality level selection and requests another chunk.
There is a tie-in between the length of a chunk, the bufferLenght and the speed with which a user-agent adapts to changing conditions. For example, should the bandwidth drop dramatically, 1 or 2 high-quality chunks will still be played from buffer before the first lower-quality chunk is shown. The other way around is also true: should a user go fullscreen, it will take some time until the stream switches to high quality. Lower bufferLenghts increase responsiveness but also increase the possiblity of buffer underruns.
The video element provides accessors for retrieving quality of service metrics:
* *downloadRate*: The current server-client bandwidth (read-only).
* *decodeRate*: The current level's bitrate (read-only).
* *droppedFrames*: The total number of frames dropped for this playback session (read-only).
* *decodedFrames*: The total number of frames decoded for this playback session (read-only).
* *height*: The current height of the video element (already exists).
* *videoHeight*: The current height of the videofile (already exists).
* *width*: The current width of the video element (already exists).
* *videoWidth*: The current width of the videofile (already exists).
In addition to this, the video element provides access to the stream levels:
* *currentLevel*: The currently playing stream level.
* *levels*: An array of all stream levels (as parsed from the manifests). Example:
In addition to this, the video element provides an event to notify scripts of changes in the current stream level:
* *levelChange*: the currentLevel attribute has just been updated.
Last, the video element provides functionality to override the user agent's built-in heuristics:
* *setLevel(level)*: This method forces the user to switch to another stream level. Invoking this method disables a user-agent's adaptive streaming heuristics. Use *setLevel(-1)* to enable heuristics again.
* *bufferLength*: This attribute controls how much videodata (in seconds) a user-agent should strive to keep buffered.
An important example for *bufferLenght*: a website owner might set this to a very high value to enable viewers on a low bandwidth to wait for buffering and still see a high-quality video.
Finally, some rationale for the choices made in this proposal. Why chunks and a manifest? Why not, for example, range-requests and <source> tags?
First and foremost, we need a format that works not only in HTML5 browsers, but also in, for example, mobile apps (Android/Blackberry/iOS), desktop players (Miro/Quicktime/VLC) and big screen devices (Roku, Boxee, PS3). Especially for the very small screens (3G network) and large screens (full HD), adaptive streaming is incredibly valuable. Tayloring a solution too much towards the HTML5 syntax and browser environment will hinder broad adoption of an open video standard. Adaptive streaming and HTML5 should work nice together, but adaptive streaming should not be relying on HTML5.
* Providing the low-tech scenario of storing chunks as separate files on the webserver enables adaptive streaming in cases where either the server, the user-agent (apps / players / settops) or the network (firewalls, cellulars) does not support something like range-requests. As an example, implementing adaptive streaming using range-requests in Adobe Flash (e.g. as temporary fallback) would not be possible, since the range-request header is blocked.
* Ecosystem partners (CDNs, encoding providers, landmark publishers, etc) are already getting used to ánd building tools around the concept of *chunked* video streams. Examples are log aggregators that roll up chunks servings into a single logline, or encoders that simultaneously build multiple stream levels, chunk them up and render their manifests.
More information about the foms