[foms] WebM Manifest

Sat Mar 19 10:19:39 PDT 2011

Steve,

I read the whole thread and unfortunately I don't have time to respond to all the points, but just to say that the two original problems I raised are still valid. Please take another look at the description.

It's easy if you have control of the whole media pipeline from reception to decoding to rendering and the device is not resource constrained. You can implement the adaptive streaming part such that it feeds potentially overlapping fragments into the de-encapsulator and all that needs to be done is to discard the overlapping data. But that API is often exactly the level at which responsibility for the code changes (from app developer to platform vendor). An API that could accept such overlapping data is not going to be widely available to us for a long time.

I'm not sure why you concluded that #2 was not an issue because the frames arrive in decode order. I did not mention anything about the order of frame arrival. The issue is duplicate decoding of a frame, which is an issue both from a decoder capability and computational load point of view.

Some comments on TCP inline...

...Mark  

Sent from my iPad

On Mar 19, 2011, at 7:01 AM, "Steve Lhomme" <slhomme at matroska.org> wrote:

> On Sat, Mar 19, 2011 at 12:02 PM, Timothy B. Terriberry
> <tterribe at xiph.org> wrote:
>>> handled in hardware. But a container with an index, metadata,
>>> chapters, etc. I doubt it would be done in hardware.
>> 
>> I am not a hardware expert, so take anything I say with a grain of salt,
>> but the codec itself is a heck of a lot more complex than the container,
>> and they seem to manage those just fine. But they at the very least
>> wanted to be able, in hardware, to skip ahead in the stream when they
>> encountered a corrupt frame for error-concealment purposes, and they
>> weren't talking about MP3. That may not require index and chapter marker
>> parsing, but it does imply that you've surrendered more control over
>> decode & presentation than you may be used to in a software system.
> 
> In the case of adaptive streaming, the hardware would still need to
> provide the necessary info to make the switch decisions. I guess to
> estimate the bandwidth it would need to know if the data are arriving
> too slow for the reader (available bandwidth too small for playback).
> That can be handled at the network level, by inspecting the level of
> buffering. If the buffer is empty too often the system (not hardware
> coded for now) would take the decision to switch to a lower bandwidth
> version and look for the next timecode it can switch to. So the
> information missing here is whether the buffer has reached that
> timecode yet or not. As said above, that information will be available
> internally at some point, so is there a reason why it should not be
> available in the hardware "API" ?
> 
> If that really is impossible in hardware, then that hardware would
> likely miss the only fragment start where it doesn't align with the
> stream its currently reading. I'd really like to know the estimation
> of how much (percentage) of missed fragments we are talking about.
> 
>>> Yes. But so does the TCP error repeat packets. Obviously TCP with a
>>> large window and a very small bandwidth is never going to work
>>> properly.
>> 
>> I think you've just described all residential internet in the US. Ask
>> Netflix how well it works.
> 
> That still doesn't make any difference if you switch after the end of
> frame n-1 or the end of a fragment.
> 
> Now that's a good point in favor of using the TCP window more. It
> could be reduced while the decision is being made or start loading
> from another stream while the main one is still loading. With a window
> of (almost) 0 the TCP connection would then be established/ready to be
> used as soon as the bandwidth is available. When you know you are
> going to switch to a new stream, you can reduce the window gradually,
> with 0 happening at the exact byte end position of the fragment (or
> n-1 frame). That would minimize the bandwidth waste and latency time
> between reading 2 fragments.

Closing the receive window just pauses the transmission. The data you originally requested will still come later unless you close the connection.

Receiving the overlap data is not really the issue (though it would be nice to avoid). The point is that you cannot detect where to stop in the old stream without parsing down to the frame level. Which ties together the media player and the adaptive streamer in a way which is both unnecessary and not aligned with existing architectures. 

> 
>>> This is the same whether you stop loading a stream at the end of a
>>> fragment or at the end of a frame (when a fragment is not a file but
>>> just a range request in a bigger file).
>> 
>> No, my point is that this is emphatically _not_ the same. If you make a
>> range request for a single fragment, then the sender knows, in advance,
>> to stop sending you more data when it reaches the end of the fragment.
>> If you don't know where the frame boundaries are, in advance (and in
>> your proposal the fragment you're switching from doesn't end at a
>> keyframe, so an index won't help you), then the sender will keep sending
>> until you tell it to stop.
> 
> That's true and using a range request is surely nicer than playing
> with the TCP window. But as shown above, playing with the TCP window
> can still be useful when switching streams. (server + DNS + TCP
> latency). Also using a "range" request has some drawbacks. It forces
> to open a new connection for each fragment, even if the new fragment
> was exactly the following of the previous fragment.

No, you re-use the same connection for the next request.

> That result in
> time and resource wasted to establish the TCP connection and the
> server side "session". If you use an "offset" request (a start offset
> but no end) you avoid that issue. And you just need to adjust your TCP
> window if you really don't want to waste a byte in the transmission.
> 
> It has already been established that having fragments of one stream in
> many file is not practical and will likely not be used (for on demand
> at least). So maybe the next step should be to NOT use range requests
> at all.

What would you use then?

> 
>>> The corresponding audio needs to be in front of the n-1 video frame.
>>> This is not exactly a requirement of WebM (only needs to be in front
>>> of a keyframe), let alone Matroska. mkclean does this automatically,
>> 
>> I tried very hard not to take this opportunity to say, "Should have used
>> Ogg," but I think I just failed.
> 
> Given Ogg is an interleaved format (and not well suited for video in
> general or streaming because of the bandwidth just wasted for the
> container), I don't see what guarantees that audio is always ahead of
> the video.
> 
> -- 
> Steve Lhomme
> Matroska association Chairman
> _______________________________________________
> foms mailing list
> foms at lists.annodex.net
> http://lists.annodex.net/cgi-bin/mailman/listinfo/foms
>