[foms] WebM Manifest

Sat Mar 19 02:34:17 PDT 2011

On Fri, Mar 18, 2011 at 9:00 PM, Mark Watson <watsonm at netflix.com> wrote:
>
> On Mar 18, 2011, at 9:20 AM, Steve Lhomme wrote:
>
>> On Fri, Mar 18, 2011 at 4:05 PM, Mark Watson <watsonm at netflix.com> wrote:
>>>
>>> On Mar 17, 2011, at 11:52 PM, Steve Lhomme wrote:
>>>
>>>> On Fri, Mar 18, 2011 at 12:10 AM, Timothy B. Terriberry
>>>> <tterribe at xiph.org> wrote:
>>>>>> In the case you describe the only drawback is that playback is not as
>>>>>> perfect as it can theoretically be. But that's expected when using
>>>>>> adaptive streaming anyway.
>>>>>
>>>>> The comments I gave before were not meant to be an exhaustive list of
>>>>> shortcomings. You also need to either a) know enough about the streams
>>>>> in advance to know whether or not such a switch will be successful
>>>>> (i.e., if you can't find that information in the manifest, then you'll
>>>>> need a full keyframe index, exposed in Javascript, which you would
>>>>> otherwise not need), meaning higher startup costs, etc., or b) you can
>>>>> try to make such a switch without knowing that it will succeed, and
>>>>> frequently download a lot of extra data which must be thrown away when
>>>>> you fail. Either way you add a lot of implementation complexity to do
>>>>> it. I guess maybe that all still falls under "playback is not as perfect
>>>>> as it can theoretically be", but that continues all the way down to, "It
>>>>> doesn't play at all."
>>>>
>>>> The manifest usually don't contain all the possible switch points
>>>> (range information) for each variant. That information is deduced from
>>>> the index that is loaded at startup (which in binary format will take
>>>> less space than XML/JSON anyway). I think that's how DASH works and
>>>> IMO it makes more sense that way.
>>>
>>> Yes, the keyframe positions are in the index.
>>>
>>> It is certainly possible to provide seamless switching without there being any keyframe alignment, it is just more difficult, involving changes deeper into the media pipeline.
>>
>> Why would it be more difficult ?
>> Non Aligned case: You play one stream and then decide you can use more
>> bandwidth, you look for the next keyframe in the stream you want to
>> switch to and do the switch at that time
>> Aligne case: You play one stream and then decide you can use more
>> bandwidth, you look for the next keyframe in the stream you want to
>> switch to and do the switch at that time
>>
>> In short, the fact that they are aligned or not as no effect.
>
> "Do the switch at that time" means different things in the two cases.
>
> First, there are two sub-cases for the "non-aligned" case. In case A, the keyframes may not be aligned between streams, but the fragment boundaries advertised in the index are aligned with keyframes. In case B, the keyframes are not aligned between streams and the fragment boundaries advertised in the index are not aligned with keyframes (The fragment boundaries may or may not be aligned with each other between streams, but this is not important).

I don't know how it works with fragmented MP4 or MPEG TS (which
doesn't have an index at all?) but speaking for Matroska and WebM in
particular, the general muxing rule is to start a Cluster (fragment)
with a keyframe and only reference that first keyframe in the index.
So case B cannot happen (hence my initial assumption). Given
fragmented MP4 is new, I suppose it could be one of the imposed rules
as well.

> Now, there are two differences between Case A and aligned keyframes.
>
> (1) when there is alignment, the downloaded data is disjoint (in time). The last downloaded fragment of the old stream ends at frame n-1, say, and the first downloaded fragment of the new stream starts at frame n. In the non-aligned case there will be overlap between these two fragments (in time). I need to discard some samples from the last fragment of the old stream. I could receive and discard them, or stop reception before the end of the fragment - both operations require parsing of the stream and so must be performed at a point in the media pipeline were the stream has been de-encapsulated. Without this requirement the adaptive streaming part can be implemented without ever parsing the stream except for the index.

Yes, this system adds the extra requirement that the switch decision
is based on knowing which "de-encapsulated" frame has been downloaded
or not. And for both audio and video streams (when they are
interleaved).

> (2) without alignment, suppose I want to start playback at frame n which is a keyframe in the new stream. It's possible that frame n-1 in the *old* stream depends on frame n. I need to decode frame n from the old stream in order to decode frame n-1 from the old stream and then discard the decoded frame n. Then I decode frame n from the new stream (I need it, it's a keyframe). You can't use frames from one stream as references for frames in another stream without artifacts.

Yes, there is a problem when B frames are used (again, not a problem
in WebM). But B frames are a problem for adaptive streaming anyway. It
imposes downloading data far ahead (like n+60) before you can play a
certain frame (n). That poses a big threat of starving the decoder
while waiting for data to load for the network, which in the worst
condition can only load at the average bitrate/speed of one frame in
that fragment. Are there any plans to enforce rules on the good use of
B frames (say not further than 2 frames) in DASH or even simply
banning them ?

-- 
Steve Lhomme
Matroska association Chairman