[theora-dev] Extension to Skeleton for multi-track media

Tue Mar 23 17:41:07 PDT 2010

Hi Ben,

Thanks for the feedback.

On Wed, Mar 24, 2010 at 1:19 AM, Benjamin M. Schwartz
<bmschwar at fas.harvard.edu> wrote:
> Silvia Pfeiffer wrote:
>> "Language", "Role" and "Name" are fields that we want to introduce to
>> better expose "semantic" information about the tracks.
>
> These three are great.  Comments:
> 1. It is common for movies to list a series of languages, and it's not
> always the case that one is dominant.  To accommodate this, we should
> permit specifying the Language field multiple times, as allowed in RFC
> 2822.  The Javascript API should return an array of language codes.
> Conventionally, the first language code should be the dominant one if
> present.  A track with no language code should return an empty array.

OK, this might be overkill, but it's obviously possible.

> 2. Some of the roles are unclear.  It would be good to add clarifying
> descriptions of their meaning and intended use.  For example, I don't know
> the motivation or use for: text/activeregion, text/annotation,
> text/transcript, text/linguistic, text/chapters, audio/music,
> audio/speech, audio/sfx.  Also, video/alpha needs to specify how a
> multichannel track (like Theora) can be rendered down to a single alpha
> channel, for example by using the unmodified bytes of Y as alpha.

Yup, will clarify, In fact, the text/alpha is probably not necessary
as it's irrelevant to the semantic role of the track.

> 3. It seems that the name is meant to be only a semi-human-readable tag,
> not a fully user-facing title.  Perhaps a localized Title field would be a
> good addition at some point.

Hmm .. it may be a good idea to add a human-readable text field.

>> A further part of the wiki page is the proposal to impose an implicit
>> order on the tracks through the order in which their BOS pages are
>> given. This is nothing semantic, but only a convenience so we can
>> ascertain that different Web browsers will address the same track by
>> the same index number through JavaScript.
>
> I reiterate my preference for associative arrays, indexed by the Ogg track
> ID and name.  The BOS ordering is unstable, and provides no benefit that I
> can see over unique stream identifiers.

I can see where you're coming from, but building an associative array
is something that the application has to do. It will create an array
saying that serialno x matches to position i on the index array.
However, the order is still not specified by this. We have to create
an order that can be maintained between applications. If browser A
decides to order the associative array as {serial_1 => 1, serial_2 =>
2, serial_3 => 3} then all other browser need to do that, too, because
otherwise the JavaScript programmer cannot address the track by index.
Thus, even if the application uses an associative array to manage the
mapping, there still has to be a fixed order.

Also, if you are suggesting to use the order of the serial numbers,
then that is just as unstable as the order of the BOS pages. All we
are interested here is a consistent order for one and the same file.
It does not have to survive any transcoding, re-shuffling or other
manipulation. It is probably not even relevant to any other app than
Web browsers. But seeing as the order of BOS pages is the natural
parsing order for tracks, that order makes the most sense.

>> Finally there are two rendering related fields that we propose
>> introducing: Display-hint and Altitude (their names could of course
>> still be changed).
>
> Altitude seems fine.  I have more problems with Display-hint:
>
> pip:
> Specifying that a track can be shown as PIP might be a good thing.  This
> mechanism seems very rigid, though.  Television sets that provide PIP
> usually let the user control the positioning, because they may want to see
> different parts of the underlying frame.  I'm not convinced that
> specifying a position or size along with the PIP hint is necessary at all.
>  If it is, the text should say "may be displayed" instead of "should be
> displayed" to indicate that the player should give the user control.
> Content producers who want exact control of overlay positioning should use
> Altitude and video/alpha.

It's a display HINT, therefore it's always just a suggestion to the
player. Whether a player offers the flexibility to the user to move a
pip overlay is not up to the media format to define.

> Where are the zero coordinates of the display area?

Ah good find: top left corner. Will add.

> If w and h are percentages, what are they percentages of?

Of the full display width and height. Will add.

> 2. mask:
> Ogg files are self-contained.  This proposal breaks that in a huge way,
> and I think it's terrible.  The right way to do this is in CSS in the
> webpage, a la
> http://labs.silverorange.com/files/video-demo/ambient.xhtml
> http://webkit.org/blog/181/css-masks/
>
> Please remove mask from the draft.

Yes, that is another train of thought. We indeed do not need the
functionality for the Web. But what about media players? Other media
format allow for inclusion of such a mask inside the media resource to
allow masking the video display. This is an attempt at introducing
this functionality into Ogg. I won't fight for it if the general
consensus is: we don't need it. But I have had this discussion that
e.g. Flash and MPEG are capable of this and Ogg isn't. This would be a
relatively simple way to introduce it.

> 3. transparentcolor.
> This will not work.  Lossy video codecs do not reproduce exact colors.  I
> am not aware of any continuous-tone image or video coding system that
> employs this approach, because it doesn't work.  Please remove it from the
> draft.  People who want transparency will have to use the video/alpha system.

Interesting ... it would have been a simple way with little overhead.
But I see the problem.

Now, how would you cut out a person from the video? Would you need to
create a new track (the "video/alpha" video track) that provides the
continuing mask over the person and makes everything around that mask
transparent? Since we don't have alpha channels in Ogg, this would be
a means to introduce alpha channels.

> Further improvements:
> As currently stated, the video/alpha label cannot actually be used to
> blend multiple tracks together.  For example, if I want an exactly
> controlled optional overlay, I would create 3 Theora tracks labeled as
> video/main, video/alpha, and video/alternate (or maybe video/additional),
> all the same size.  The altitude of the additional track would be higher
> than the main, to indicate that it goes on top.  There are now at least
> three possibilities:
> 1. The alpha track applies to the additional track.
> 2. The alpha track applies to the main track (before compositing)
> 3. The alpha track applies to the whole video (after compositing)
>
> At present, there is no way to distinguish these cases, and the situation
> is even more underspecified in the case of multiple additional tracks.  To
> remedy this, I recommend an additional header field "Applies-to: [name]".
>  This indicates the name of the track to which a track applies.  For
> example, a text track may apply to the to audio track of which it is a
> transcription, and the video onto which it should be overlayed.  A
> video/sign track Applies-to the audio track of which it is a translation.
>  A video/alpha track Applies-to each track it is supposed to mask (before
> compositing).

This is exactly the dependency discussion that is at the end of the wiki page.

I agree, in the case of the alpha channel, the dependency between the
alpha track and a particular video track is really high and it needs
to be specified that this alpha channel does not make sense without
the video channel.

However, in the case of transcripts, this dependency isn't actually as
high. The transcript makes still sense without the audio track.

Also, the dependency of a sign track on an audio track isn't as high
either - they can still exist in isolation and the "Applies-to" only
indicates for display purposes which audio and sing language track
should be displayed together.

I can see different needs for dependencies here.

> For video/alpha, this is still insufficient, because masking a video and
> an overlay before compositing them is not the same as masking after
> compositing.  To permit masking after compositing, video/alpha tracks
> should optionally have one or more Altitudes.  For each Altitude held by a
> video/alpha track, it applies to the composited result of all visible
> higher tracks.

Yes, I agree - the "video/alpha" approach is a hack and not a feature.
Is this even the best way to go about it? Would it make more sense to
change Theora to include possibility for an alpha channel?

Cheers,
Silvia.