[Ogg a11y] architecture for video and accessibility on the Web
silviapfeiffer1 at gmail.com
Tue Oct 28 18:31:57 PDT 2008
It has been a bit quiet here recently. Probably mainly because I have
not thrown any strange suggestions in your direction yet. :-)
The main reason why I was so quiet is that I was learning about other
existing standards (TimedText, SMIL, sub, etc) and wondering how it
all fits together.
I have also attended the recent W3C working group meeting and had many
discussions with people one-on-one there.
One thing is confirmed at this stage: the list of requirements that we
made at https://wiki.mozilla.org/Accessibility/Video_a11y_requirements
seems to be fairly complete. This is good so now we can move on.
So, today I would like to propose an architectural idea for how to
deal with video, audio, and annotations (in the greater sense of the
word) for an accessible Web.
Let me start by putting a few side conditions down, so we all think
along similar lines. If you disagree with any of these, let's discuss
because they have a strong impact on the solution.
* Our focus for video (and audio) accessibility here stems from a need
for the Web; accessibility will also be required in other applications
than the Web Browser, so some thought should also be spent on how to
enable offline video accessibility without much loss of functionality
(e.g. less styling capabilities)
* Our focus is on getting accessibility to work with HTML5's video and
audio elements; this means that the way in which accessibility
features are exposed to Web Browsers should be similar to the way they
currently work with HTML4 pages
* Also, Web Browsers already implement a lot of functionality for
styling text; we should try to re-use that functionality without
enforcing a complex styling model on offline video applications
* Audio and video files may be found anywhere on the Web and may
contain annotation tracks (such as captions, subtitles or audio
descriptions - let's call them "text codecs") - alternatively these
annotation tracks may also be stored in companion resources (files)
that may reside on the same server or on a different server
* There will be Web services that multiplex audio/video files and
their annotation tracks as text codecs together to make more
self-contained media resources to store to disk, view offline, and
share with friends
* Web Browsers may receive annotation tracks for media resources
either within the resource or from a different resource possibly even
from a different server; they have to cope with both situations and be
able to play back multiplexed and companion resources. This could
however be hidden from the Web Browser through the media framework. [I
am not 100% sure about this decision - it would be easier, but less
flexible to just focus on multiplexed resources.]
* There will always be a multitude of media formats and codecs to deal
with - whatever scheme we develop for dealing with annotations has to
work across different encapsulation formats and services; Ogg is the
main target for our solution here though
* There will also be a need to keep innovation potential open on
subtitles, captions, audio annotations and similar schemes so we
cannot outright decide for one particular format but may rather deal
with an abstract model of time-aligned annotations.
* The Web Browser needs to know what type of tracks it is dealing with
and be able to select them for display or route them to the right
device dependent on Browser settings provided by the user.
So, given these conditions, here are some thoughts on architecture:
* The User generally uses online video and audio for one of two purposes:
* Download requires the Browser to provide one file of multiplexed
video or audio with text codecs.
* Playback can work via one connection to a multiplexed media
resource, or via multiple connections to the media data and the text
data separately. Assuming the user can only provide one URL to the
media resource, the latter case would need to have a URL to a resource
description such as SMIL or ROE through which the player can then
re-issue a request to several separate resources. I am not sure this
is desirable and would like to discuss this.
* For the multiplexed media resource case, one can imagine having
partial resources available from one or more Web servers and a Web
service that can combine them together into a multiplexed media
resource based on the demands of a Browser's request, which in turn is
based on the user preferences.
* In either case the media decoding subsystem of the Browser will need
to hand over audio, video and text data to the Browser. If the text
data was already in some XML format, the Browser's internal XML parser
could be re-used to create a DOM in a nested browsing context for the
Web page that the video or audio tag are part of. It would in general
be easier and more flexible to provide a nested browsing context DOM
for each text codec of a media resource rather than defining a
* A text codec is then an XML format that would follow a standard
structure that is able to be temporally multiplexed into a media
resource. An example would be:
<head> ... </head>
<div start="t1" end="t2"> ... </div>
<div start="tx" end="ty"> ... </div>
<div start="tz1" end="tz2"> ... </div>
Further tags in the head and in the divs could then be defined freely,
but the file would still generically map into a media resource by
taking the <head> as header data and each <div> as a codec packet.
* Since it's XML (or maybe better even: HTML-like), the HTML means of
attaching style information to elements can also be applied to these
elements and provide styling commands to the Browser.
* We would further define a media mapping of such a resource into Ogg
and implement this in a little library such that it can be used to
create multiplexed media resources for the download case.
* Having defined this generic interface for text codecs, it will be
easy afterwards to write an xslt (XML transformation) that can take
e.g. a 3GPP TimedText file or a CMML file or a srt file and convert
into a text codec format, map it into Ogg, and allow the Browser to
create a nested browsing context DOM for it that can be accessed by
accessibility devices and by the user.
* I would further offer to myself getting involved in the newly
re-opened TimedText Working Group at the W3C to work out the best XML
format template for text codecs.
I would really really love to get some feedback on these ideas. What
do you like about it - what do you dislike. What do you think is
impossible to do. Why will it work / not work.
And please don't hesitate to ask for clarifications where I am being
unclear. I will try and put this all into a wiki page for starters.
More information about the Accessibility