[foms] Proposal: adaptive streaming using open codecs

Mon Nov 15 18:58:13 PST 2010

On Nov 15, 2010, at 2:22 PM, Silvia Pfeiffer wrote:

On Tue, Nov 16, 2010 at 4:49 AM, Steve Lhomme <slhomme at matroska.org<mailto:slhomme at matroska.org>> wrote:
On Mon, Nov 15, 2010 at 6:48 PM, Steve Lhomme <slhomme at matroska.org<mailto:slhomme at matroska.org>> wrote:
Doesn't it lead to more sync issues when the files you received are
not interleaved ? The 2 streams may not load at the same speed (once
better cached than the other for example). It also makes it harder to
estimate the current download speed... That's an edge case, but
precisely the kind of odd network behaviour that "adaptative"
streaming is meant to handle.

One big pro for non interleaved is that switching between languages
(or regular/commentary track) is a lot easier and the only reasonable
way to handle it server side.

PS: And also allows something not possible now: listen to music from
video sites without having to load the video part. It's possible with
RTP but the quality (on YouTube for ex) is just not there.

I believe we are optimizing for the wrong use cases by trying to
provide data to the Web browser in a non-interleaved manner. I would
not put that functionality into the adaptive HTTP streaming layer, but
into other technologies.

Firstly, providing different language audio tracks to the Web browser
for a video can be handled at the markup level. There is work in
progress on this anyway because we will see video descriptions and
sign language video that will need to be delivered on demand in
parallel to the main video. I would prefer we do not try to solve this
problem through adaptive HTTP streaming - it seems to wrong layer to
get this sorted.

For me the characterizing feature of "adaptive HTTP Streaming" is that you have multiple streams available that are *precisely synchronized*, in a manner that allows for switching mid-stream without stopping the A/V clock or loosing synchronization.

Whether the streams are different bitrate video, audio with different numbers of channels or in different languages or even video with/without open captions doesn't change this basic feature.

I'm not sure that separate HTML video/audio tags carry (or should carry) these semantics: the mechanisms required to switch seamlessly are quite low level and it makes sense to me to wrap the streams with this synchronization property at some lower level than HTML (i.e. in a manifest). Then they can be presented them to HTML as a package with various switchable options. It's also a problem which is common across a number of environments, not all of which use HTML (but which might share the same source streams).

if you do try to represent all this at the HTML layer, you don't solve the media pipeline synchronization problem. It is still there.

Secondly, the use case of picking up only an audio track from a video
is also one that can be solved differently. It requires a process on
the server anyway to extract the audio data from the video and then it
would be a user request. So, it would probably come through a media
fragment URI such as http://example.com/video.ogv?track=audio  which
would be processed by the server and an audio resource would be
delivered, if the service provider decides to offer such
functionality.

You can do a lot with special server-side features, but it makes scaling harder/more expensive. It's a really valuable feature, IMO, if a system can work with standard, commodity web servers and caches.

As I have understood adaptive HTTP streaming, it is supposed to be a
simple implementation where the player's only additional functionality
is in interpreting a manifest file and switching between resource
chunks rather than byte ranges of a single resource.

That describes one approach, but it's an approach with a lot of disadvantages as we've discussed. And you're making a lot of assumptions: Supposed by whom ? Additional functionality to what ? What's not simple about byte rage requests ?

All the decoding
pipeline continues to stay in tact and work as previously.

Compared to what, though ? Our service is deployed on 100s of devices that all had pre-existing media pipelines and we have no problems using the approach of separate audio and video transport on the network side. Many of these devices do use HTML for the presentation layer and so I think these kind of cases really should be in scope of discussions of how to handle adaptive streaming in an HTML5 context.

I think we
should not touch the interleaved delivery functionality at this level.
It would cause the player to do too much synchronisation and network
delivery handling overhead work that should really not be created from
a single resource.

Really I don't think it's nearly as bad as you suggest. Media samples are not delivered to decoders in perfect synchronization anyway: final synchronization always has to be done at the renderer. It's valid to have an mp4 file where the audio samples are some distance from their video counterparts. Close interleaving is good for reducing buffering requirements but isn't required by player implementations (consistent interleaving probably is). The adaptive streaming engine on the network side can easily deliver samples with synchronization as close as in a perfectly valid mp4 file.

The 'network delivery handling' is no more complex than issuing HTTP requests in sequence, each time for the media type whose buffer is lowest.

By contrast, the problem of creating correctly interleaved and aligned chunks is far from trivial. Audio and Video cannot start at 'exactly' the same time in a chunk due to the different sample sizes. If you want simple appending across bitrates you need to take great care that the skew in start times is consistent across versions. Not easy if you have, for example, different sized audio samples or different video bitrates.

Or you can drop the requirement for simple appending, have the client download overlapping data when it switches and scan each media type for a suitable switch point (which may not be aligned between audio and video). You have to bear in mind that a RAP in the new stream might not be a RAP in the old one and indeed there may be dependencies in the old stream on frames after the new-frame-RAP position  (implying some frames are decoded twice, in both old and new stream). At least this is the requirement in Apple's HTTP Live Streaming. I think this is a much bigger change in media pipeline design.

"SImple appending", handled separately for each stream, gives you the best of both worlds. Preparation is simple (no need for coordination on interleaving and chunking) and playback is simple too, because you can easily re-construct at the client something well-enough interleaved at the input to an existing media pipeline.

...Mark

Cheers,
Silvia.
_______________________________________________
foms mailing list
foms at lists.annodex.net<mailto:foms at lists.annodex.net>
http://lists.annodex.net/cgi-bin/mailman/listinfo/foms

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.annodex.net/cgi-bin/mailman/private/foms/attachments/20101115/d3e7591e/attachment-0001.htm