[foms] Proposal: adaptive streaming using open codecs
philipj at opera.com
Tue Oct 19 08:53:49 PDT 2010
On Mon, 18 Oct 2010 13:58:19 +0200, Jeroen Wijering
<jeroen at longtailvideo.com> wrote:
> Hello all,
> Here is a (rough and incomplete) proposal for doing adaptive streaming
> using open video formats. WebM is used as an example, but all points
> should apply to Ogg as well. Key components are:
> * Videos are served as separate, small chunks.
> * Accompanying manifest files provide metadata.
> * The user-agent parses manifests and switches between stream levels.
> * An API provides QOS metrics and enables custom switching logic.
> What do you think of this approach - and its rationale? Any technical
> issues (especially on the container side) or non-technical objections?
> Kind regards,
Thanks for writing this up, Jeroen. Before going into inline replies, I
want to state the problem with chunking on a lower level. We have two
blobs of audio/video data which we want to play back-to-back gapless. From
the point of view of a decoding pipeline, there are basically two options:
1. Treat everything as an infinite stream in a single decoding pipeline,
and have the demuxer handle chained Ogg or chained WebM.
2. Have each chunk be its own finite resource and set up a decoding
pipeline for each one, having a super-pipeline coordinating those and
handling audio mixing.
I believe that option 1 is a lot easier to integrate with existing media
frameworks, while option 2 adds a lot of complexity. Opera doesn't only
have to worry about working with GStreamer, but also about hardware
devices with its own media stack where we can't easily fix stuff.
Going with option 1, we basically add the constraint that all chunks must
use the same container format and that container format must be streamable
and chainable. This is true of Ogg and can be made true for WebM. It's
slightly less general, but probably a tradeoff worth doing.
Before building everything into the browser, I'd really prefer to provide
More on that below...
> Every chunk should be a valid video file (header, videotrack,
> audiotrack). Every chunk should also contain at least 1 keyframe (at the
> start). This implies every single chunk can be played back by itself.
> Beyond validity, the amount of metadata should be kept as small as
> possible (single-digit kbps overhead).
> Codec parameters that can vary between the different quality levels of
> an adaptive stream are:
> * The datarate, dimensions (pixel+display) and framerate of the video
> * The datarate, number of channels and sample frequency of the audio
> In order for quality level switches to occur without artifacts, the
> start positions of all chunks should align between the various quality
> levels. If this isn't the case, user-agents will display artifacts
> (ticks, skips, black) when a quality level switch occurs. Syncing should
> not be a requirement though. This will allow legacy content to be used
> for dynamic streaming with little effort (e.g. remuxing or using a smart
> server) and little issues (in practive, most keyframes are aligned
> between different transcodes of a video).
> In its most low-tech form, chunks can be stored as separate
> files-on-disc on a webserver. This poses issues around transcoding (no
> ecosystem yet) and file management (not everybody loves 100s of files).
> A small serverside module can easily fix these issues:
> * User-agent requests from chunks are translated to byte-range requests
> inside a full video.
> * The data is pulled from the video.
> * The data is wrapped into a valid WebM file.
> * (The resulting chunk is cached locally.)
If the decoder is unaware that there are chunks to begin with, the
requirements don't need to be this strict. It would just as well be
possible to have bare slices of data and making sure that a full header is
provided only when switching streams. Basically, it would be up to the web
author, but the above would work if we treat it as chained WebM.
> The M3U8 manifest format that Apple specified
> (http://tools.ietf.org/html/draft-pantos-http-live-streaming-04) is
> adopted. Generally, both an overall manifest (linking to the various
> quality levels) and a quality level manifest (linking to the various
> stream levels) are used. (Though, especially for live streaming, a
> single quality level may be used).
> Here's an example of such an overall manifest. It specifies three
> quality levels, each with its own datarate, codecs and dimensions:
> Here's an example manifest for one such quality level. It contains a
> full URL listing of all chunks for this quality level:
> The video framerate, audio sample frequency and number of audio channels
> cannot be listed here according to the specs. Both in WebM and in
> MPEG-TS (the container Apple specifies), this can be retrieved during
> The M3U8 playlist format can be used to provide *sliding windows* (for
> livestreaming). Additionally, regular ID3 tags can be used to enrich the
> manifest with metadata.
The constant re-fetching of a manifest is quite unappealing and not
something I'd be very happy to build in as a part of <video>. This is
quite an easy problem to solve, and I'd be happy to let developers roll
their own, perhaps one of:
* ever-increasing number
* JSON manifest
* URL of next chunk being delivered via WebSockets (in the future, the
data itself could be as well, but that's certainly not for dumb servers)
> The root manifest serves as the single, unique reference point for a
> adaptive stream. Therefore, user agents need solely its URL to playback
> the stream.
> Here's an example for loading a root manifest: through the *src*
> attribute of the <video> tag in an HTML page:
> <video width="480" height="270" src="http://example.com/video.m3u8">
> <a href="http://example.com/video_low.webm">Download the video</a>
> In this variation, the manifest is loaded through the <source> tag, to
> provide fallback logic:
> <video width="480" height="270" >
> <source src="http://example.com/video.m3u8" type="video/m3u8">
> <source src="http://example.com/video_low.webm" type="video/webm">
> <a href="http://example.com/video_low.webm">Download the video</a>
> Here's another example for loading the manifest; through the *enclosure*
> element in an RSS feed:
> <rss version="2.0">
> <title>Example feed</title>
> <description>Example feed with a single adaptive
> <title>Example stream</title>
> <enclosure length="1487" type="video/m3u8"
> url="http://example.com/video.m3u8" />
> Like the manifest parsing, the switching heuristics are upon the
> user-agent. They can be somewhat of a *secret sauce*. As a basic
> example, a user-agent can select a quality level if:
> * The *bitrate* of the level is < 90% of the server » client
> * The *videoWidth* of the level is < 120% of the video element *width*.
> * The delta in *droppedFrames* is < 25% of the delta in *decodedFrames*
> for this level.
> Since droppedFrames are only known after a level has started playing, it
> is generally only a reason for switching down. Based upon the growth
> rate of droppedFrames, a user-agent might choose to blacklist the
> quality level for a certain amount of time, or discard it altogether for
> this playback session.
> The quality level selection occurs at the start of every chunk URL
> fetch. Given an array of levels, the user-agent starts with the highest
> quality level first and then walks down the list. If the lowest-quality
> level does not match the criteria, the user-agent still uses it (else
> there would be no video).
> A user-agent typically tries to maintain X (3, 10, 20) seconds of video
> ready for decoding (buffered). If less than X seconds is available, the
> user-agent runs it quality level selection and requests another chunk.
> There is a tie-in between the length of a chunk, the bufferLenght and
> the speed with which a user-agent adapts to changing conditions. For
> example, should the bandwidth drop dramatically, 1 or 2 high-quality
> chunks will still be played from buffer before the first lower-quality
> chunk is shown. The other way around is also true: should a user go
> fullscreen, it will take some time until the stream switches to high
> quality. Lower bufferLenghts increase responsiveness but also increase
> the possiblity of buffer underruns.
This does of course increase the likelihood of everything breaking due to
something like this might make its way into the browser.
> Certain user-agents might not offer access to adaptive streaming
> heuristics. Other user-agents might, or should even do so. The obvious
> The video element provides accessors for retrieving quality of service
> * *downloadRate*: The current server-client bandwidth (read-only).
This is already available in an obscure form in some browsers via the
buffered attribute. If a lot of people need it, we could expose it of
course, but then preferably as a seconds/second metric, to match the rest
of the API.
> * *decodeRate*: The current level's bitrate (read-only).
What's this? The number of frames already decoded but not yet rendered?
> * *droppedFrames*: The total number of frames dropped for this playback
> session (read-only).
> * *decodedFrames*: The total number of frames decoded for this playback
> session (read-only).
Yep, what Firefox has. Is this the metric you prefer? MY guess is that
you'd be more interested in the performance around now (a window of X
seconds) than globally, especially when the video stream has switched from
low to high quality or vice versa.
> * *height*: The current height of the video element (already exists).
> * *videoHeight*: The current height of the videofile (already exists).
> * *width*: The current width of the video element (already exists).
> * *videoWidth*: The current width of the videofile (already exists).
> In addition to this, the video element provides access to the stream
> * *currentLevel*: The currently playing stream level.
> * *levels*: An array of all stream levels (as parsed from the
> manifests). Example:
> bitrate: 100000,
> codecs: 'vp8,vorbis',
> duration: 132,
> height: 180,
> url: manifest_100.m3u8,
> width: 240
> bitrate: 500000,
> codecs: 'vp8,vorbis',
> duration: 132,
> height: 360,
> url: manifest_500.m3u8,
> width: 640
> In addition to this, the video element provides an event to notify
> scripts of changes in the current stream level:
> * *levelChange*: the currentLevel attribute has just been updated.
> Last, the video element provides functionality to override the user
> agent's built-in heuristics:
> * *setLevel(level)*: This method forces the user to switch to another
> stream level. Invoking this method disables a user-agent's adaptive
> streaming heuristics. Use *setLevel(-1)* to enable heuristics again.
So, this is where I'm not entirely supportive. Keeping track of several
different streams in the same <video> element becomes a bit messy, as the
state of HTMLMediaElement then becomes a bit weird. How would one
interpret the buffered ranges, videoWidth, videoHeight, etc, when these
will be different for the different streams? Letting the video element
pretend that there's just a single infinite stream would be simpler, in
> * *bufferLength*: This attribute controls how much videodata (in
> seconds) a user-agent should strive to keep buffered.
> An important example for *bufferLenght*: a website owner might set this
> to a very high value to enable viewers on a low bandwidth to wait for
> buffering and still see a high-quality video.
Right, this would be useful in general as well, I think, and it's a magic
constant which exists somewhere inside the browser anyway if it tries to
conserve bandwidth at all.
> Finally, some rationale for the choices made in this proposal. Why
> chunks and a manifest? Why not, for example, range-requests and <source>
> First and foremost, we need a format that works not only in HTML5
> browsers, but also in, for example, mobile apps
> (Android/Blackberry/iOS), desktop players (Miro/Quicktime/VLC) and big
> screen devices (Roku, Boxee, PS3). Especially for the very small screens
> (3G network) and large screens (full HD), adaptive streaming is
> incredibly valuable. Tayloring a solution too much towards the HTML5
> syntax and browser environment will hinder broad adoption of an open
> video standard. Adaptive streaming and HTML5 should work nice together,
> but adaptive streaming should not be relying on HTML5.
> That said:
> * Providing the low-tech scenario of storing chunks as separate files on
> the webserver enables adaptive streaming in cases where either the
> server, the user-agent (apps / players / settops) or the network
> (firewalls, cellulars) does not support something like range-requests.
> As an example, implementing adaptive streaming using range-requests in
> Adobe Flash (e.g. as temporary fallback) would not be possible, since
> the range-request header is blocked.
Have you seen this problem a lot? As you know, all browsers implementing
<video> use range requests for seeking. So far, I haven't seen any
problems reported with it. That's not to say that there are no problems,
it's just that there's not a lot of <video> content out there yet.
> * Ecosystem partners (CDNs, encoding providers, landmark publishers,
> etc) are already getting used to ánd building tools around the concept
> of *chunked* video streams. Examples are log aggregators that roll up
> chunks servings into a single logline, or encoders that simultaneously
> build multiple stream levels, chunk them up and render their manifests.
> * With just the QOS metrics (*downloadRate* and *decodedFrames*) in
> place, it will be possible to build adaptive-streaming-like solutions
> is supported (and very popular) within both Flowplayer and JW Player.
> True adaptive streaming (continous switching without buffering) won't be
> possible, but the experience is good enough to suit people that don't
> have the encoder or browser (yet) to build or playback adaptive streams.
Having said "no" to so much, I should contribute something positive as
well... Apart from disagreeing on how much should go into the browser, I
think we all agree that the lower-level building blocks *should* go into
What I'm proposing is that the lower-level API be one that allows multiple
URLs to be treated as a single stream of bytes from the demuxers
perspective. The Stream API  certainly has a suitable name for it, so
perhaps it could be hijacked for this purpose.
More information about the foms