[foms] WebM Manifest

Sat Mar 19 07:01:16 PDT 2011

On Sat, Mar 19, 2011 at 12:02 PM, Timothy B. Terriberry
<tterribe at xiph.org> wrote:
>> handled in hardware. But a container with an index, metadata,
>> chapters, etc. I doubt it would be done in hardware.
>
> I am not a hardware expert, so take anything I say with a grain of salt,
> but the codec itself is a heck of a lot more complex than the container,
> and they seem to manage those just fine. But they at the very least
> wanted to be able, in hardware, to skip ahead in the stream when they
> encountered a corrupt frame for error-concealment purposes, and they
> weren't talking about MP3. That may not require index and chapter marker
> parsing, but it does imply that you've surrendered more control over
> decode & presentation than you may be used to in a software system.

In the case of adaptive streaming, the hardware would still need to
provide the necessary info to make the switch decisions. I guess to
estimate the bandwidth it would need to know if the data are arriving
too slow for the reader (available bandwidth too small for playback).
That can be handled at the network level, by inspecting the level of
buffering. If the buffer is empty too often the system (not hardware
coded for now) would take the decision to switch to a lower bandwidth
version and look for the next timecode it can switch to. So the
information missing here is whether the buffer has reached that
timecode yet or not. As said above, that information will be available
internally at some point, so is there a reason why it should not be
available in the hardware "API" ?

If that really is impossible in hardware, then that hardware would
likely miss the only fragment start where it doesn't align with the
stream its currently reading. I'd really like to know the estimation
of how much (percentage) of missed fragments we are talking about.

>> Yes. But so does the TCP error repeat packets. Obviously TCP with a
>> large window and a very small bandwidth is never going to work
>> properly.
>
> I think you've just described all residential internet in the US. Ask
> Netflix how well it works.

That still doesn't make any difference if you switch after the end of
frame n-1 or the end of a fragment.

Now that's a good point in favor of using the TCP window more. It
could be reduced while the decision is being made or start loading
from another stream while the main one is still loading. With a window
of (almost) 0 the TCP connection would then be established/ready to be
used as soon as the bandwidth is available. When you know you are
going to switch to a new stream, you can reduce the window gradually,
with 0 happening at the exact byte end position of the fragment (or
n-1 frame). That would minimize the bandwidth waste and latency time
between reading 2 fragments.

>> This is the same whether you stop loading a stream at the end of a
>> fragment or at the end of a frame (when a fragment is not a file but
>> just a range request in a bigger file).
>
> No, my point is that this is emphatically _not_ the same. If you make a
> range request for a single fragment, then the sender knows, in advance,
> to stop sending you more data when it reaches the end of the fragment.
> If you don't know where the frame boundaries are, in advance (and in
> your proposal the fragment you're switching from doesn't end at a
> keyframe, so an index won't help you), then the sender will keep sending
> until you tell it to stop.

That's true and using a range request is surely nicer than playing
with the TCP window. But as shown above, playing with the TCP window
can still be useful when switching streams. (server + DNS + TCP
latency). Also using a "range" request has some drawbacks. It forces
to open a new connection for each fragment, even if the new fragment
was exactly the following of the previous fragment. That result in
time and resource wasted to establish the TCP connection and the
server side "session". If you use an "offset" request (a start offset
but no end) you avoid that issue. And you just need to adjust your TCP
window if you really don't want to waste a byte in the transmission.

It has already been established that having fragments of one stream in
many file is not practical and will likely not be used (for on demand
at least). So maybe the next step should be to NOT use range requests
at all.

>> The corresponding audio needs to be in front of the n-1 video frame.
>> This is not exactly a requirement of WebM (only needs to be in front
>> of a keyframe), let alone Matroska. mkclean does this automatically,
>
> I tried very hard not to take this opportunity to say, "Should have used
> Ogg," but I think I just failed.

Given Ogg is an interleaved format (and not well suited for video in
general or streaming because of the bandwidth just wasted for the
container), I don't see what guarantees that audio is always ahead of
the video.

-- 
Steve Lhomme
Matroska association Chairman