[foms] WebM Manifest

Fri Mar 18 13:00:12 PDT 2011

On Mar 18, 2011, at 9:20 AM, Steve Lhomme wrote:

> On Fri, Mar 18, 2011 at 4:05 PM, Mark Watson <watsonm at netflix.com> wrote:
>> 
>> On Mar 17, 2011, at 11:52 PM, Steve Lhomme wrote:
>> 
>>> On Fri, Mar 18, 2011 at 12:10 AM, Timothy B. Terriberry
>>> <tterribe at xiph.org> wrote:
>>>>> In the case you describe the only drawback is that playback is not as
>>>>> perfect as it can theoretically be. But that's expected when using
>>>>> adaptive streaming anyway.
>>>> 
>>>> The comments I gave before were not meant to be an exhaustive list of
>>>> shortcomings. You also need to either a) know enough about the streams
>>>> in advance to know whether or not such a switch will be successful
>>>> (i.e., if you can't find that information in the manifest, then you'll
>>>> need a full keyframe index, exposed in Javascript, which you would
>>>> otherwise not need), meaning higher startup costs, etc., or b) you can
>>>> try to make such a switch without knowing that it will succeed, and
>>>> frequently download a lot of extra data which must be thrown away when
>>>> you fail. Either way you add a lot of implementation complexity to do
>>>> it. I guess maybe that all still falls under "playback is not as perfect
>>>> as it can theoretically be", but that continues all the way down to, "It
>>>> doesn't play at all."
>>> 
>>> The manifest usually don't contain all the possible switch points
>>> (range information) for each variant. That information is deduced from
>>> the index that is loaded at startup (which in binary format will take
>>> less space than XML/JSON anyway). I think that's how DASH works and
>>> IMO it makes more sense that way.
>> 
>> Yes, the keyframe positions are in the index.
>> 
>> It is certainly possible to provide seamless switching without there being any keyframe alignment, it is just more difficult, involving changes deeper into the media pipeline.
> 
> Why would it be more difficult ?
> Non Aligned case: You play one stream and then decide you can use more
> bandwidth, you look for the next keyframe in the stream you want to
> switch to and do the switch at that time
> Aligne case: You play one stream and then decide you can use more
> bandwidth, you look for the next keyframe in the stream you want to
> switch to and do the switch at that time
> 
> In short, the fact that they are aligned or not as no effect.

"Do the switch at that time" means different things in the two cases.

First, there are two sub-cases for the "non-aligned" case. In case A, the keyframes may not be aligned between streams, but the fragment boundaries advertised in the index are aligned with keyframes. In case B, the keyframes are not aligned between streams and the fragment boundaries advertised in the index are not aligned with keyframes (The fragment boundaries may or may not be aligned with each other between streams, but this is not important).

Now, there are two differences between Case A and aligned keyframes.

(1) when there is alignment, the downloaded data is disjoint (in time). The last downloaded fragment of the old stream ends at frame n-1, say, and the first downloaded fragment of the new stream starts at frame n. In the non-aligned case there will be overlap between these two fragments (in time). I need to discard some samples from the last fragment of the old stream. I could receive and discard them, or stop reception before the end of the fragment - both operations require parsing of the stream and so must be performed at a point in the media pipeline were the stream has been de-encapsulated. Without this requirement the adaptive streaming part can be implemented without ever parsing the stream except for the index.

(2) without alignment, suppose I want to start playback at frame n which is a keyframe in the new stream. It's possible that frame n-1 in the *old* stream depends on frame n. I need to decode frame n from the old stream in order to decode frame n-1 from the old stream and then discard the decoded frame n. Then I decode frame n from the new stream (I need it, it's a keyframe). You can't use frames from one stream as references for frames in another stream without artifacts.

Then, in case B, things are worse because I do not know where the keyframes are in the new stream until I have downloaded a fragment. I need to download a fragment that completely overlaps (in time) with data I have already downloaded from the old stream. Then I look for a keyframe, then I am in the same situation as (1) or (2) above. This is what you need to do with Apple HTTP Live Streaming.

None of this so hard if you have a software decoder on an unconstrained platform and you're in a position to modify that.

But if you think of delivering content to 100s of different devices, each with their own implementation many in hardware and resource constrained, it's a big deal to expect support for these functions on all those devices.

...Mark

> 
> -- 
> Steve Lhomme
> Matroska association Chairman
>