[foms] Proposal: adaptive streaming using open codecs

Fri Nov 19 03:56:17 PST 2010

That said, I think the best compromise would be for browsers to accept both interleaved and elementary streams: when there's only one
video track and one audio track (which probably covers 99% of the video content out there), it makes sense not to add complexity for
publishers and ask them to demux their videos (some probably don't know exactly what video internals are about anyway).
[Steve]
Having to support 2 ways of doing the same thing (and from the same standard) is not a good thing. And even though you have 1 audio track
and 1 video track, we're talking about adaptative streaming. That means multiple versions of these same tracks cut in many small chuncks.
So the fact that the chuncks are for 1 track or many doesn't make too much difference in the end. Also we're talking about something to support
the web in general, in an open way. That means people should be able to create their stream, put them on *any* website and it will work. So
those chuncks, in general, have to be in different files. (unlike the specialized server you talked about earlier which is a local (and good) optimisation).

My understanding was that the demuxing job was already performed in most media framework anyway, i.e. the container was stripped out and
A/V samples were extracted and passed separately to the relevant decoder. What I'm proposing is just this: accepting a stream with audio and
video muxed, or accepting 2 separate streams (one audio and one video). For instance :

- WebM + VP8 + Vorbis
- WebM + VP8 only
- WebM + Vorbis only

In any case, each stream is playable independently as indicated earlier. On top of that, we may split a stream in several chunks (physically or not),
but that's an additional layer.

[David]
Any website that supports range requests can handle non-chunked files.  A requirement that a new adaptive streaming model work on *any*
website is excessive.

But without chunks, you probably miss some caching opportunities and you still get those nasty misconfigured proxy-caches problems.

- either interleaved or elementary streams sent from the server side
[Mark]
This is fine for me: I'm not arguing that interleaved streams should not be supported, just that separate elementary streams should.

My point above exactly. Supporting both muxed and elementary streams, with the addition of being able to "inject" them at the JS API level
would be a tremendous leap forward for adaptive streaming support in browsers.

- multiple versions/qualities for the same content advertised using a "main" manifest file (not tied to the HTML markup because we
 probably want this to work outside browsers)
[michael]
I like what you proposed, but I think we should also support a HTML markup based manifest. This will make the DOM structure for stream
selection parallel a dom based tag representation. This way the varius stream can be quired in a way that's consistent with existing html5
video tag api conventions.

Don't know about this one. I'm comparing this to media URL inside external CSS files.

Also a lot of players are moving towards what could be called an "iFrame based manifest" that encapsulates everything to do with the player
interface and its associated assets.  Having an HTML mark-up representation would enable one less round trip to the server for quick
start playback.

Agreed. But we are talking about big media files anyway, so it shouldn't be a huge problem (compared to the "probing dance" that most
implementation are doing today before playing actually starts).

- multiple audio/video/<you-name-it> tracks for the same content also advertised in this main manifest file

- main manifest files may refer to streams directly, or to "playlist" manifest files (in case the publisher willingly choose to use fragments)

- playlist manifest files list all fragments with precise timestamps and duration (i.e. not "a-la-Apple-M3U8")
[Mark]
Even if you use separate chunks, you don't necessarily have to list them explicitly. There's usually a consistent
naming scheme, so a simple template approach can save you having big playlists.

Agreed, but I still don't understand how you determine the starting time and duration for each chunk in this case.

In this case you can still put everything into one manifest (no need for a "main" one pointing to "playlist" ones.).
And you don't really need precise timing in the manifest.

Mmmm, not so sure about that last point.

Approximate timing is sufficient as long as you have precise timing in the files.

Yes, but I don't want to have to read the files to know the precise media mapping: the player would ideally know exactly which chunk
maps to which timerange right from the beginning. Anyway, I think we should have both modes: naming template and extensive list
(I think this is what you proposed in a post later on).

For example, perhaps all files are 10s long according to the manifest, but the real boundaries are put at the nearest RAP to a multiple of 10s.

The error margin may be quite high, especially when using VBR encoding and not equally-time-spaced keyframes, don't you think?

- JSON for all manifest files (as it's easy to parse on any platform and heavily extensible)
[Jean-Baptiste]
I disagree. Most software and hardware media players don't have any JSON code-related already. How is JSON so much better
than XML? Many playlists are already XML based, I don't see where is the need to impose yet another format to parse to them?

I didn't say media players had JSON-related code already, just that it's easy (and cheap as far as CPU is concerned) to parse JSON
and that simple librairies exist on virtually any platform for any language (http://json.org/). Also, JSON is a native format in JS, so this
would make a lot of sense in browsers, if for instance the manifest files were to be downloaded through WebSockets and parsed at
the applicative level.

XML and JSON are both good candidates for hierarchical (multi-level) representation of manifests, M3U8 really isn't. This is also why
I'm not really confortable with the "The M3U8 manifest format that Apple specified is adopted" statement at http://wiki.whatwg.org/wiki/Adaptive_Streaming.

[Mark]
What is more important than the specific syntax (JSON, XML, M3U8 etc.) is the data model or abstract syntax. Once you have that you can map pretty
easily to any specific syntax. It would be good to discuss what is needed at that level first.

Agreed.

Roughly, something like the following:
- some notion of <track> which is composed of one or more <stream>s (the specific terms can be changed).
- A <track> is a single media type or interleaved combination of multiple types. The <stream>s it contains are different encodings of the exact same
source media (so different audio languages are different <tracks>). If the <track> contains multiple media types every <stream> has all those media
types interleaved i.e. all <stream>s contained interleaved media or no <stream>s contain interleaved media within a <track>. In the second case all
the <stream>s in the track contain the same single media type.
- a way to annotate both <track>s and <stream>s with their properties for selection purposes: file format, media type(s), codecs/profiles, language,
video resolution, pixel aspect ratio, frame rate, bandwidth, accessibility type (e.g. audio description of video, sign language) or other track type info
(e.g. directors commentary) (Maybe this is too many, but annotations like this are cheap - clients just ignore tracks with annotations they do not understand).
If all the <stream>s in a track have the same value for an annotation then you can annotate the <track> otherwise annotate the <stream> (that is just an optimization).
- access information for each <stream>. EITHER
(i) a URL for a single file containing the whole stream, including stream headers and an index, OR
(ii) a URL for the stream headers and a list of URLs for the chunks and timing information for the chunks (could be constant chunk size)

Yes, supporting both modes would be just great.

By stream headers I mean initialization information that applies to the whole stream.
Some additional details:
- we've discussed various ways that chunks (however obtained) are aligned/can be concatenated or not. Additional <track> annotations are needed (IMO)
to tell the player what properties the streams have in terms of chunk alignment, RAP positions etc. (Compatibility in terms of codecs etc. should be clear
from the annotations)
- you might want to use templates instead of long URL lists (as per another mail thread). If you do use long URL lists, you might want to be to store them in a
separate files ("submanifests").
- wherever a URL appears, it's useful (and for our service essential) to be able to provide a list of alternative URLs (in the same way DNS provides a list of
alternative IP addresses). We use this for redundancy across CDNs.
- how you find the headers and index in case (i) and their format may be file format dependent.
- if the group wants to choose a single option between (i) and (ii) then I would obviously recommend (i). But you might want to include both options.
- content protection is a whole big topic which may not be of so much concern to this group. But the manifest aspects can be handled easily with additional annotations.
If there is support for this kind of "data model" approach, then I'd be happy to write up a more detailed proposal based on whatever discussion is triggered by the above.

Very interesting indeed, but aren't we up to re-invent SMIL all the way down here? ;-)

- all interleaved and elementary streams - chunked or not - are independently playable (and start with a RAP for video)
[Mark]
Did you mean that individual chunks are independently playable, or that the concatenation of the chunks should be
independently playable ?

The former.

In most formats there are file/codec headers that you don't want to repeat in every chunk (seems to be true for mp4 and WebM).

Agreed for MP4, but for "streamable" containers (like MPEG-TS of FLV), the overhead is really minimal for most codecs.
And it would probably ease implementation (and debugging) this way.

- the chosen container is easily "streamable", and require minimum headers overhead to pass codec initialization info
[Mark]
I think formally what you want is that the headers contain only information which is about the whole stream (in time).
Information which is about some part (in time) of the stream should be distributed at appropriate points in the file.

Correct, unless you have the playlist manifest that "map" all the chunks, even if the PCR is reset in every chunk (PTS/DTS
should be 0-based in every chunk anyway).

[Steve]
I agree, even though that may make Vorbis a little less competitive (8 KB codec initialization).

8K ?! What exactly is being passed in 8K for an audio codec initialization sequence ?
(by comparison, the initialization sequence for AAC is only 2 bytes)

- RAP may not be aligned between versions (or at least it shouldn't be a strong requirement even if in practice it's often
 the case), thus end-user experience with no-glitch stream-switching would depend on renderer and double-decoding/buffering
 pipeline capabilities
[Mark]
I believe it is very unlikely that many kinds of device will support the kind of double-decoding need to splice at arbitrary points
without any alignment between versions. So, I think it's important that the manifest at least indicate whether the fragmentation
has the nice alignment and RAP position properties needed to enable seamless switching without such media pipeline enhancements.
Devices without that capability can then choose whether to switch with glitches, or not switch at all. Providers can decide whether
they want to prepare content which switches seamlessly on all devices or just on some. We defined the necessary alignment properties
quite precisely in MPEG and included such an indicator.

Agreed. I'm just saying that aligning RAP shouldn't be a pre-requisite.

[Steve]
A crossfader will be necessary for audio to avoid any glitch when switching streams. That's (a little) more buffering needed.
By the way, we should also keep in mind that at some point this audio/video transmission may end up going both ways (a la
Chatroulette), so we should keep the client/server as little complex as possible.

This is what Apple are doing in their implementation, but it may prove quite complex to do the same for most browsers vendors.

[David]
Requiring clients to maintain multiple decode pipelines, splice video, and crossfade audio, is excessive and unnecessary complexity
that few devices will be able to support.

Yes. But at the same time, this is not necessarily a reason for pushing an "aligned-RAP-among-all-versions-of-the-same-content" pre-requisite
in the upcoming proposal.

Pierre-Yves

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.annodex.net/cgi-bin/mailman/private/foms/attachments/20101119/adea9194/attachment-0001.htm