[theora] Http adaptive streaming for html5

Tue Dec 29 04:43:24 PST 2009

Hi Richard,

It would be great to get further input into the discussion from you -
if you have any suggestions for solutions (or add to the list of
issues) that would be great!

On Tue, Dec 29, 2009 at 4:50 AM, Richard Watts <rrw at kynesim.co.uk> wrote:
> Silvia Pfeiffer wrote:
>>
>> On Mon, Dec 21, 2009 at 2:49 PM, Michael Dale <mdale at wikimedia.org> wrote:
>>>
>>> Conrad Parker wrote:
>
> [snip]
>>>
>>> Silvia did a good post outlining the major solutions to this problem of
>>> exposing the structure of a composite media source:
>>>
>>> http://blog.gingertech.net/2009/11/25/manifests-exposing-structure-of-a-composite-media-resource/
>
> Hello all,
>
>  [ I did quite a lot of work on this problem for OTT video delivery
> for an STB manufacturer a couple of years ago and I think we have
> customers actually using the feature, so I'll wade in :-) ]
>
>  Indeed; good post!
>
>>>
>>> michael
>>
>> Not everyone is sold on that, but it is indeed a discussion that we
>> will need to continue to have at the WHATWG/W3C.
>>
>> Also, I agree with Michael that we need to simplify ROE in that we
>> need to remove the "sequence" support. There will be a discussion
>> about SMIL vs ROE at some point and I really don't want SMIL or
>> anything similarly complex and unimplementable.
>
>  My experience with trying to do this in Windows Media land is that
> you can't. I think your aims are laudable (I once shared them), but
> structured video support is in fact a crawling horror and I think
> you just have to live with it.
>
>  MS's solution to this, which is actually the least worst I've come
> across, is to allow you to specify a structured video file as any
> video resource.

Are you talking about the smooth streaming solution for silverlight
and iis7? I actually also quite like it, since it doesn't require
changes to the way in which HTTP works and servers only need to add an
additional description file. Those SMIL-like files (ism files, see
http://msdn.microsoft.com/en-us/library/ee230810.aspx) look sane.

>  That structured video file then specifies a fairly traditional DAG
> of video resources (in fact, MS's is a tree but there's no need for
> it to be), annotated with various information to help you choose
> which of the alternates to play.

DAG as in directed acyclic graphs?

>  ASX (the format I was using) is a ghastly nightmare of a format,
> but the approach is, I think, broadly correct.

The smooth streaming seems to use the MPEG-4 file format as the basis
(as does Apple's approach). Are you talking about an older
specification of MSs?

> It has a few nice
> properties:
>
>  * It allows the client to discover just enough about a video resource
>   to play it.
>
>  * It allows the client to buffer video segment switches across
>   trickplay transitions (especially reverse trickplay)
>
>  * It allows servers to load-balance or do ad carousel substitutions
>    by forcing clients to re-request sub-playlists.
>
>  * If you use a DAG with global URIs, clients can reuse previously
>   buffered objects multiple times (so: only store advert X once, but
>    play it in every YouTube stream)
>
>  * The playlist has a fairly conventional DOM which can be walked by
>   Javascript to influence the media player - so you can add
>   application-specific tags and URLs for particular web apps.
>
>  * The informal separation is that:
>
>    -> The browser reads the playlist.
>    -> The video player reads the metadata in the video resource.
>    -> The video decoder decodes the video.
>
>   Video information flows down that list, and user actions
>   (pause, play, got to the end of a file) up it, so:
>
>    * The browser decides which streams to play.
>    * The video player decides which codecs it needs to play them.
>    * The decoder decodes the video elements.
>
>    * The user hits 'pause'.
>    * Decoder pauses, notifies player.
>    * Player pauses fetch. Tells browser.
>    * Browser fires its JS events.
>
>
>  The client-side processing for these playlists is _horrible_, but
> it is at least possible, in a way that it simply isn't with many other
> systems.
>
>  Another thing you will want to do is to allow the _server_ to
> switch streams mid-delivery - this is a matter for the transport
> property, but could possibly be done by returning an
>
> X-Stream-URI: < .. .>
>
>  As with all TCP streaming, for acceptable quality, it is absolutely
> vital to stream at exactly the right bitrate lest you be bitten
> by TCP's poor reaction to even momentary congestion. Some routers
> can be very unforgiving: you need both time of flight and bulk data
> measurements (indeed, smoothed bulk data measurements) to keep your
> buffers sized correctly - too much buffer and your channel change times
> go through the root, too little and you jerk.

Yeah, I think that has been the problem with traditional bitrate based
switching approaches on HTTP.
Also, I am worried about the setup time for decoding for different
bitrate-encoded web resources - loading each for the first time will
create quite an overhead - other times should be faster.

>>
>> The multitrack audio and video discussion has started happening at
>> W3C/WHATWG (can't remember which group it was now), but I have seen
>> huge push-back from browser developers, in particular where the files
>> come from different servers. It seems it's just too complex at this
>> point in time.
>
>  Tell them to try harder :-). It's a bit nasty, but by no means
> impossible if you think about the problem in a fairly structured way.
> Particularly if you have both CPU and memory, which modern browser
> developers do.

There are at least two browser vendor developers on this list here. :)

>> Also, we need to be careful about mixing too many things into ROE: I
>> would advise against doing dynamically switched content *and*
>> specification of stream composition (text tracks and how they relate
>> to each other and to the a/v) through ROE
>
>  So would I, but I think it's unavoidable. Apart from anything else,
> if I'm going to have my video communicate with my HTML, the only sane
> way to do it is to put an 'event track' on the video and I am going
> to need to know which event track goes with which bit of my video.

What do you call an "event track"? We have no such thing in Ogg.

>  If HTML5 doesn't specify a simple way to do this, people will
> simply do it in Javascript and that is a recipe for (a) disaster,
> and (b) immense webpages - apart from anything else, much of this
> discussion is about getting your timing right, and timing is one thing
> Javascript just does not do.
>
>  (on which note, you will notice that getting video-embedded events
> to fire Javascript events in a useful way is ghastly in several
> dimensions :-))
>
>
>>  I am continuing to think
>> hard about what could be a solution for accessibility for HTML5 video,
>> because there are so many interests pushing in different directions. I
>> only know it has to be done in a really simple way, otherwise we won't
>> get it implemented.
>
>  I think you should be able to get away with most of your accessibility
> in audio/subtitle tracks and stick anything else in event tracks and
> delegate the rest to JS?
>
>>
>> As for dynamically switched content: What speaks against using Apple's
>> live streaming draft
>> (http://tools.ietf.org/html/draft-pantos-http-live-streaming-01)?
>
>  My first objection is the $2.50/unit I need to pay to the MPEG-LA
> for the MPEG 2 Systems licence to package my video as TS/PS.

Ah yes, that would be a big problem. Let's instead do something that
every codec an follow and isn't covered by license fees yet.

>  My second is that many hardware demuxes find it extremely
> difficult to cope with PATs and PMTs following each other directly
> and with them immediately preceeding an Iframe. It's going to make
> life hard for STBs and anyone else who doesn't have the computing
> power to do everything software. In practice, you will get a
> frame skip every time you go over a file transition.

Yeah - the reason why I think I like the Silverlight smooth streaming
approach better.

>  My third is that it introduces yet another file format to a program
> that really doesn't need one (and it's not visible to JS either).
> What's wrong with XML?

M3U? yeah, I supposed a XML based one would be nicer, but OTOH files
are simple and thus conversion/parsing isn't hard.

>
>  S6.1: dividing streams into trickplayable segments - bear in mind
> that it is an extremely difficult problem for H.264 streams. Best
> left to the client (or supply a separate track indicating the trickplay
> points) - since the client almost by definition has the code and the
> server probably doesn't.
>
>  How do you name your alternate playlists?
>
>  Other than that, it doesn't seem offensively bad. I'd invent
> something else if it became a standard though.
>
> [snip]
>>
>> I also think we need a playlist file format for HTML5. It should be
>> acceptable as a resource sequence for the video or audio element -
>> i.e. a playlist should be either a audio playlist or a video playlist,
>> but not mixed. Also, I think it would be preferable if all the videos
>> in the playlist were encoded with the same codec, i.e. all Ogg
>> Theora/Vorbis or all MP4 H.264/AAC. Further, it would be preferable if
>> all the videos in a playlist had the same track composition. But this
>> is where it becomes really difficult and unrealistic. So, I worry
>> about how to expose the track composition of videos for a playlist.
>> Wouldn't really want to load a xspf playlist that requires a ROE file
>> for each video to be loaded to understand what's in the resource. That
>> would be a double loading need. Maybe the playlist functionality needs
>> to become a native part of HTML5, too?
>
>  Again, laudable aims, but given the multiplicity of video formats out
> there and what people will actually do with them I seriously doubt that
> anyone will keep to that kind of a spec - the extensions will then
> become de-facto standards and you might as well not have burdened them
> with the original standard.

So, what is your suggestion instead?

Cheers,
Silvia.