[foms] Proposal: adaptive streaming using open codecs

Mark Watson watsonm at netflix.com
Fri Nov 19 09:37:26 PST 2010

On Nov 19, 2010, at 3:56 AM, Pierre-Yves KEREMBELLEC wrote:

Approximate timing is sufficient as long as you have precise timing in the files.

Yes, but I don't want to have to read the files to know the precise media mapping: the player would ideally know exactly which chunk
maps to which timerange right from the beginning. Anyway, I think we should have both modes: naming template and extensive list
(I think this is what you proposed in a post later on).

For example, perhaps all files are 10s long according to the manifest, but the real boundaries are put at the nearest RAP to a multiple of 10s.

The error margin may be quite high, especially when using VBR encoding and not equally-time-spaced keyframes, don't you think?

Yes, the error could be as big as the maximum keyframe spacing.

I guess I am assuming the approach where the chunks are aligned across versions. When seeking the error is not such a big deal. When switching you know you have chunk alignment. Or if you follow the Apple approach instead then you always have some overlap downloading a searching for a switch point anyway.

Very interesting indeed, but aren't we up to re-invent SMIL all the way down here? ;-)

Not exactly, but it's worth discussing why we don't just use SMIL for all of this and call it a day. This was discussed in some detail in 3GPP last year. The feature set of SMIL and the feature set needed for adaptive streaming do intersect, but there are a few things we need for adaptive streaming which are not in SMIL and a LOT of things in SMIL which are not needed for adaptive streaming.

The intersect is also rather awkward. You can use <par> to define alternatives. And <seq> to define chunks. But you end up duplicating the set of <par> elements in each element of the <seq>. And there is no semantic linkage between the different versions in each time period (nothing in SMIL implies there is any relationship between the first alternative in one time period and the first alternative in the next time period.) It's also not clear what to do in SMIL about the fact that the audio and video in an interleaved chunk end at slightly different times, and begin at slightly different times in the next chunk to compensate. An out-of-the-box SMIL player might not play that seamlessly.

Yes, you can shoehorn it in, but its awkward and verbose and not at all clear that existing SMIL players would "do the right thing". Modifying an existing player is not likely to be any easier than a from-scratch adaptive streaming implementation, given the complexity of SMIL (and the comparative simplicity of adaptive streaming).

I think we should be thinking of adaptive streaming as just another stream type which appears to any presentation layer (HTML5, SMIL, whatever) as a simple audio/video stream with some switchable properties (like audio language, subtitles etc) and hides the complexities of adaptivity and switching from the presentation layer. Switching, splicing, chunking, bitrates etc. are all things that should worry us video geeks and we shouldn't concern the presentation design people with these things.


- all interleaved and elementary streams - chunked or not - are independently playable (and start with a RAP for video)
Did you mean that individual chunks are independently playable, or that the concatenation of the chunks should be
independently playable ?

The former.

In most formats there are file/codec headers that you don't want to repeat in every chunk (seems to be true for mp4 and WebM).

Agreed for MP4, but for "streamable" containers (like MPEG-TS of FLV), the overhead is really minimal for most codecs.
And it would probably ease implementation (and debugging) this way.

- the chosen container is easily "streamable", and require minimum headers overhead to pass codec initialization info
I think formally what you want is that the headers contain only information which is about the whole stream (in time).
Information which is about some part (in time) of the stream should be distributed at appropriate points in the file.

Correct, unless you have the playlist manifest that "map" all the chunks, even if the PCR is reset in every chunk (PTS/DTS
should be 0-based in every chunk anyway).

I agree, even though that may make Vorbis a little less competitive (8 KB codec initialization).

8K ?! What exactly is being passed in 8K for an audio codec initialization sequence ?
(by comparison, the initialization sequence for AAC is only 2 bytes)

- RAP may not be aligned between versions (or at least it shouldn't be a strong requirement even if in practice it's often
 the case), thus end-user experience with no-glitch stream-switching would depend on renderer and double-decoding/buffering
 pipeline capabilities
I believe it is very unlikely that many kinds of device will support the kind of double-decoding need to splice at arbitrary points
without any alignment between versions. So, I think it's important that the manifest at least indicate whether the fragmentation
has the nice alignment and RAP position properties needed to enable seamless switching without such media pipeline enhancements.
Devices without that capability can then choose whether to switch with glitches, or not switch at all. Providers can decide whether
they want to prepare content which switches seamlessly on all devices or just on some. We defined the necessary alignment properties
quite precisely in MPEG and included such an indicator.

Agreed. I'm just saying that aligning RAP shouldn't be a pre-requisite.

A crossfader will be necessary for audio to avoid any glitch when switching streams. That's (a little) more buffering needed.
By the way, we should also keep in mind that at some point this audio/video transmission may end up going both ways (a la
Chatroulette), so we should keep the client/server as little complex as possible.

This is what Apple are doing in their implementation, but it may prove quite complex to do the same for most browsers vendors.

Requiring clients to maintain multiple decode pipelines, splice video, and crossfade audio, is excessive and unnecessary complexity
that few devices will be able to support.

Yes. But at the same time, this is not necessarily a reason for pushing an "aligned-RAP-among-all-versions-of-the-same-content" pre-requisite
in the upcoming proposal.



-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.annodex.net/cgi-bin/mailman/private/foms/attachments/20101119/642fda2c/attachment-0001.htm 

More information about the foms mailing list