[theora] Http adaptive streaming for html5

Tue Dec 22 23:51:18 PST 2009

On Mon, Dec 21, 2009 at 2:49 PM, Michael Dale <mdale at wikimedia.org> wrote:
> Conrad Parker wrote:
>> Right, so I think the main thing to spec out first is the use cases. I
>> guess from Michael's point of view, the content site is both in charge
>> of the videos and the html, so it's ok for the web page to list all
>> the optional sources.
>>
>> However if the web page and the content are from different places (eg.
>> a random blog article embedding a youtube/tinyvid/whatever video) then
>> the blog author can't be expected to know all the options or be
>> authoritative for them, and it makes more sense to simply link to a
>> single media type which then describes the options.
>>
>> I think it's useful for people to be able to take video that is
>> published under the first scenario and link to it externally -- to
>> simply be able to take the contents of <video src="..."/> and copy
>> that into their blog. I think listing all the options in the markup
>> would make that difficult; I think the list of options should be under
>> the control of the video host, and they should be able to change them
>> over time (eg. offering more versions for more popular videos).
>>
>> Conrad.
>>
>
> True, same goes for subtitles which is why ROE might be nice since it
> takes timed text into consideration as well.
>
> Silvia did a good post outlining the major solutions to this problem of
> exposing the structure of a composite media source:
> http://blog.gingertech.net/2009/11/25/manifests-exposing-structure-of-a-composite-media-resource/
>
> michael

Not everyone is sold on that, but it is indeed a discussion that we
will need to continue to have at the WHATWG/W3C.

Also, I agree with Michael that we need to simplify ROE in that we
need to remove the "sequence" support. There will be a discussion
about SMIL vs ROE at some point and I really don't want SMIL or
anything similarly complex and unimplementable.

The multitrack audio and video discussion has started happening at
W3C/WHATWG (can't remember which group it was now), but I have seen
huge push-back from browser developers, in particular where the files
come from different servers. It seems it's just too complex at this
point in time.

Also, we need to be careful about mixing too many things into ROE: I
would advise against doing dynamically switched content *and*
specification of stream composition (text tracks and how they relate
to each other and to the a/v) through ROE  I am continuing to think
hard about what could be a solution for accessibility for HTML5 video,
because there are so many interests pushing in different directions. I
only know it has to be done in a really simple way, otherwise we won't
get it implemented.

As for dynamically switched content: What speaks against using Apple's
live streaming draft
(http://tools.ietf.org/html/draft-pantos-http-live-streaming-01)? Are
the client implementation needs in that proposal reasonable? Is that
something Mozilla might consider implementing? I must say that
personally I don't like the proposal because it piggybacks onto a
playlist specification. Playlists are sequences of separate resources.
Dynamically switched content, however, really relates only to a single
resource, but composed of segments at different bandwidths.

I guess, thinking about this pragmatically: it is important to keep
the sequential switching stuff separate from the description of
parallel tracks and to keep resources separate. And all of this is
different to what the <source> elements are today: alternative
resources. Further, I would rather not want to have it all specified
in a single file - because that would start looking like SMIL, mix all
these dimensions together and make it possible to overlap them, which
is where the complexity comes in.

So, here are some current thoughts of mine:

I think the sequential switching stuff should as much as possible be
hidden away from the client and be a server-side thing, exposed to the
Web as a single resource. Other than telling the server what bandwidth
it is currently receiving, the client should not need to do anything,
IMHO. Not sure how that can be done other than something like what
Apple proposed.

I think the track composition description should be part of HTML5,
because it describes what the resource is made up of and allows the
client to expose the tracks without needing to download anything. HTML
is good for describing external resources, such as images, objects,
stylesheets, scripts etc. And it should not need to download the media
file or anything else to simply tell the user what is available.

I also think we need a playlist file format for HTML5. It should be
acceptable as a resource sequence for the video or audio element -
i.e. a playlist should be either a audio playlist or a video playlist,
but not mixed. Also, I think it would be preferable if all the videos
in the playlist were encoded with the same codec, i.e. all Ogg
Theora/Vorbis or all MP4 H.264/AAC. Further, it would be preferable if
all the videos in a playlist had the same track composition. But this
is where it becomes really difficult and unrealistic. So, I worry
about how to expose the track composition of videos for a playlist.
Wouldn't really want to load a xspf playlist that requires a ROE file
for each video to be loaded to understand what's in the resource. That
would be a double loading need. Maybe the playlist functionality needs
to become a native part of HTML5, too?

Cheers,
Silvia.