[theora] Http adaptive streaming for html5

Mon Dec 28 09:50:46 PST 2009

Silvia Pfeiffer wrote:
> On Mon, Dec 21, 2009 at 2:49 PM, Michael Dale <mdale at wikimedia.org> wrote:
>> Conrad Parker wrote:
[snip]
>>
>> Silvia did a good post outlining the major solutions to this problem of
>> exposing the structure of a composite media source:
>> http://blog.gingertech.net/2009/11/25/manifests-exposing-structure-of-a-composite-media-resource/

Hello all,

  [ I did quite a lot of work on this problem for OTT video delivery
for an STB manufacturer a couple of years ago and I think we have
customers actually using the feature, so I'll wade in :-) ]

  Indeed; good post!

>>
>> michael
> 
> Not everyone is sold on that, but it is indeed a discussion that we
> will need to continue to have at the WHATWG/W3C.
> 
> Also, I agree with Michael that we need to simplify ROE in that we
> need to remove the "sequence" support. There will be a discussion
> about SMIL vs ROE at some point and I really don't want SMIL or
> anything similarly complex and unimplementable.

  My experience with trying to do this in Windows Media land is that
you can't. I think your aims are laudable (I once shared them), but
structured video support is in fact a crawling horror and I think
you just have to live with it.

  MS's solution to this, which is actually the least worst I've come
across, is to allow you to specify a structured video file as any
video resource.

  That structured video file then specifies a fairly traditional DAG
of video resources (in fact, MS's is a tree but there's no need for
it to be), annotated with various information to help you choose
which of the alternates to play.

  ASX (the format I was using) is a ghastly nightmare of a format,
but the approach is, I think, broadly correct. It has a few nice
properties:

  * It allows the client to discover just enough about a video resource
    to play it.

  * It allows the client to buffer video segment switches across
    trickplay transitions (especially reverse trickplay)

  * It allows servers to load-balance or do ad carousel substitutions
     by forcing clients to re-request sub-playlists.

  * If you use a DAG with global URIs, clients can reuse previously
    buffered objects multiple times (so: only store advert X once, but
     play it in every YouTube stream)

  * The playlist has a fairly conventional DOM which can be walked by
    Javascript to influence the media player - so you can add
    application-specific tags and URLs for particular web apps.

  * The informal separation is that:

     -> The browser reads the playlist.
     -> The video player reads the metadata in the video resource.
     -> The video decoder decodes the video.

    Video information flows down that list, and user actions
    (pause, play, got to the end of a file) up it, so:

     * The browser decides which streams to play.
     * The video player decides which codecs it needs to play them.
     * The decoder decodes the video elements.

     * The user hits 'pause'.
     * Decoder pauses, notifies player.
     * Player pauses fetch. Tells browser.
     * Browser fires its JS events.

  The client-side processing for these playlists is _horrible_, but
it is at least possible, in a way that it simply isn't with many other
systems.

  Another thing you will want to do is to allow the _server_ to
switch streams mid-delivery - this is a matter for the transport
property, but could possibly be done by returning an

X-Stream-URI: < .. .>

  As with all TCP streaming, for acceptable quality, it is absolutely
vital to stream at exactly the right bitrate lest you be bitten
by TCP's poor reaction to even momentary congestion. Some routers
can be very unforgiving: you need both time of flight and bulk data
measurements (indeed, smoothed bulk data measurements) to keep your
buffers sized correctly - too much buffer and your channel change times
go through the root, too little and you jerk.

> 
> The multitrack audio and video discussion has started happening at
> W3C/WHATWG (can't remember which group it was now), but I have seen
> huge push-back from browser developers, in particular where the files
> come from different servers. It seems it's just too complex at this
> point in time.

  Tell them to try harder :-). It's a bit nasty, but by no means
impossible if you think about the problem in a fairly structured way.
Particularly if you have both CPU and memory, which modern browser
developers do.

> 
> Also, we need to be careful about mixing too many things into ROE: I
> would advise against doing dynamically switched content *and*
> specification of stream composition (text tracks and how they relate
> to each other and to the a/v) through ROE 

  So would I, but I think it's unavoidable. Apart from anything else,
if I'm going to have my video communicate with my HTML, the only sane
way to do it is to put an 'event track' on the video and I am going
to need to know which event track goes with which bit of my video.

  If HTML5 doesn't specify a simple way to do this, people will
simply do it in Javascript and that is a recipe for (a) disaster,
and (b) immense webpages - apart from anything else, much of this
discussion is about getting your timing right, and timing is one thing
Javascript just does not do.

  (on which note, you will notice that getting video-embedded events
to fire Javascript events in a useful way is ghastly in several
dimensions :-))

>  I am continuing to think
> hard about what could be a solution for accessibility for HTML5 video,
> because there are so many interests pushing in different directions. I
> only know it has to be done in a really simple way, otherwise we won't
> get it implemented.

  I think you should be able to get away with most of your accessibility
in audio/subtitle tracks and stick anything else in event tracks and
delegate the rest to JS?

> 
> As for dynamically switched content: What speaks against using Apple's
> live streaming draft
> (http://tools.ietf.org/html/draft-pantos-http-live-streaming-01)?

  My first objection is the $2.50/unit I need to pay to the MPEG-LA
for the MPEG 2 Systems licence to package my video as TS/PS.

  My second is that many hardware demuxes find it extremely
difficult to cope with PATs and PMTs following each other directly
and with them immediately preceeding an Iframe. It's going to make
life hard for STBs and anyone else who doesn't have the computing
power to do everything software. In practice, you will get a
frame skip every time you go over a file transition.

  My third is that it introduces yet another file format to a program
that really doesn't need one (and it's not visible to JS either).
What's wrong with XML?

  S6.1: dividing streams into trickplayable segments - bear in mind
that it is an extremely difficult problem for H.264 streams. Best
left to the client (or supply a separate track indicating the trickplay
points) - since the client almost by definition has the code and the
server probably doesn't.

  How do you name your alternate playlists?

  Other than that, it doesn't seem offensively bad. I'd invent
something else if it became a standard though.

[snip]
> I also think we need a playlist file format for HTML5. It should be
> acceptable as a resource sequence for the video or audio element -
> i.e. a playlist should be either a audio playlist or a video playlist,
> but not mixed. Also, I think it would be preferable if all the videos
> in the playlist were encoded with the same codec, i.e. all Ogg
> Theora/Vorbis or all MP4 H.264/AAC. Further, it would be preferable if
> all the videos in a playlist had the same track composition. But this
> is where it becomes really difficult and unrealistic. So, I worry
> about how to expose the track composition of videos for a playlist.
> Wouldn't really want to load a xspf playlist that requires a ROE file
> for each video to be loaded to understand what's in the resource. That
> would be a double loading need. Maybe the playlist functionality needs
> to become a native part of HTML5, too?

  Again, laudable aims, but given the multiplicity of video formats out
there and what people will actually do with them I seriously doubt that
anyone will keep to that kind of a spec - the extensions will then
become de-facto standards and you might as well not have burdened them
with the original standard.

  Um, anyway that was probably a bit vague of me - sorry if it was -
do yell if there's anything anyone would like to know,

Richard.