[vorbis-dev] granulepos start/end revisited

Sat May 22 23:38:34 PDT 2004

----- Original Message -----
From: "Arc Riley" <arc at xiph.org>
To: <vorbis-dev at xiph.org>
Sent: Sunday, May 23, 2004 12:29 PM
Subject: Re: [vorbis-dev] granulepos start/end revisited

<p>> On Sun, May 23, 2004 at 01:39:12PM +1000, Conrad Parker wrote:
> >
> > The only relevent difference between the two schemes is that the
state-change
> > style injects an extra packet at the end-time of the subtitle's
presentation.
> > All subtitle buffering features that you discuss are identical, however
> > by going from a state-change style to a duration-only style you've lost
> > end-confirmation packets.
>
> Since you've effectivly argued the exact same point as my previous reply
was to, and failed
> to respond to the examples, I'm going to give you the exact same reply
again.  I've appended
> more below, but please read through this first.
>
> On Fri, May 21, 2004 at 10:53:45AM -0400, Arc Riley wrote:
> >
> > Subtitle A lasts 0 to 31
> > Subtitle B lasts 5 to 20
> > Subtitle C lasts 28 to 35
> >
> > Putting this on a chart:
> > 00  02  04  06  08  10  12  14  16  18  20  22  24  26  28  30  32  34
> > AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
> >           BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
> >                                                         CCCCCCCCCCCCCCCC
> >
> > Long phrases are redefined every few seconds, so it's considered OK for
> > an unknown phrase to not be shown for a few seconds after seek.
> > However, if a phrase was already known before seek, it should still be
> > shown after seek if it's still valid.
> >
> > Say you played the above stream until 10, then seeked to 25, in the
> > current method you would get A, know immediatly that A lasts until
> > granule 31, then get B and know immediatly that it lasts until granule
> > 20.  When you seek, the new granulepos will drop B (because it's
> > expired) but A is still valid for awhile so it's kept.  C will be
> > received further down and display properly.  Nice and tidy.
> >
> > Now say this were to be replaced by the start-stop method you described.
> > A would "turn on" granule 0, then B would "turn on" granule 5, then you
> > seek.  Uh-O!  Did the seek skip A getting turned off? B getting turned
> > off?  Better destroy all known phrases, just to be sure.  When seeking
> > is completed, even tho we would have otherwise known that A is still
> > valid, A would not be shown.  If it's especially long you may see it get
> > redefined later down the stream, but it wont be displayed until then.
> >
> > No, it's FAR better to know when things are going to stop ahead of time.
> > In either case the decoder has to keep track of phrases, because in the
> > first you keep track of them so they're turned off at the right time, in
> > the second you keep track of them so they can be turned off when the
> > "off" packet for them is received.  Complexity wise, within the code,
> > neither is really more complicated if we don't consider the seeking
> > issue.  In the first the codec is responsible for clean-up, in the
> > second the stream is responsible for clean-up.  The prior is much better
> > since you should never trust the Ogg stream to be without errors, and
> > once you throw the seeking issue in the way we're doing things now
> > becomes the obvious optimal solution.
>
>
> Now, do you understand the problem at hand?
>
> In the implementation of Writ, switching to end-pos doesn't work.  The
seek mechanism needs
> to know which granules are provided by a given page, so in either case the
granule of the
> page needs to be the start granule of that phrase, not the end.

My personal opinion is that all codecs could benefit from having both start
and stop times in the framing data.

>
> In the start-stop mechanism you describe, to repeat myself, you loose the
ability to know
> which phrases are "still on" after a seek.  In order to fo the start-stop
method that you're
> recommending, you have to clear your entire phrase table during a seek
because you don't
> know if you've skipped a "stop" packet or not.
>

I think this is a general limitation of most subtitling systems...
Realistically what percentage of seeks start and end within a phrase
duration. I would guess not too many. Having stop times only solves a very
small part of the larger problem.

In terms of user experience what you really want is all subtitles always
shown when they are supposed to be. Neither method really solves the lost
subtitle problem, both just take different approaches to mitigating it.
Considering subtitle phrases are generally short, short range searches would
be better done with a linear forward scan. In the average case you can scan
forward a hundred pages or so, faster than the binary search can even synch
to a frame synch point for a asingle iteration of a binary search.

Considering that most subtitles would have a duration of less than 20
seconds. For seeks of these short durations, a linear forward scan would be
faster than the binary search in most cases. And if you linear scan forward
it doesn't matter which approach you use as you see all intervening subtitle
pages and can act on them accordingly.

For subtitles longer than 20 seconds, and assuming the fact that the missing
subtitle problem is inevitable in some cases, realistically how time
critical can a a subtitle of such long duration really be. If it lasts that
long it probably isn't that strongly tied to a specific time frame. Where
subtitles represent dialogue the dration is lmiited by the time to utter the
phrase which is ineveitable short and the amount of subtitle text that can
be displayed at one time which is also short. These are the subtitles that
are most important to be always correctly displayed.

Neither method solves the case where a short (important) dialogue subtitle
packet appears just before the point you seek to... which is the worst case.
ie a subtitle starts at time 5 seconds(with duration 1.5 second) and you
seek to 5.01 seconds, meaning you have the maximum duration of missing
subtitle in the refreshable subtitle scenario and in the case of short
display duration the subtitle will expire in a time so short that there is
no refresh, which means a completely lost subtitle.

> It also results in more overhead.  Not only would phrases need IDs
(prehaps referencing them
> by start granule, tho?) and stop packets, but you loose functionality by
doing so.
>

But having phrase id's also adds functionality, not possible without it.

Realistically how many different phrases can be displayed at one time
(assuming overlays)... probably not more than 2, one at the top one at the
bottom.

Consider the case where you have 10 tracks... the subtitles in each
langauge. Obviously you only want to display one of these at a time, now
without phrase/track id's each one needs it's own stream so they can be
differentiated by ogg page id's. I would contend that the framing overhead
of having 10 streams is much greater than having phraseids within a page.

Now what if there is two tracks in each langauge. One the dialogue subtitles
and one say directors notes. Using 20 logical streams is there any way to
associate that there are pairs which correspond to each other ? ie the
directors notes and the dialogue in each langauge ?

> Writ isn't the only issue that the start-time rule applies to.  Take, for
instance, MNG.
> Each frame has a variable delay between them, and the frames can be
especially seperated.
> In some cases (using it as a video codec) it's continuous, but in other
cases (as a subtitle
> codec) it's discontinuous.  Using start-time in the latter case eliminates
alot of issues.
>
> Yet another example is MIDI.  It defines notes.  Said notes are not simple
"turn on, turn
> off" either, alot of them have elements of sustain-release and so on.
Each note has a start
> granule and an end granule.  The notes can overlap.  The entire
end-granule scheme falls
> apart in this method, and trying to make the end-granule scheme work by
trying to make MIDI
> fit into the ridgid rules of "start-stop only" limits it's functionality
for no reason.
> Rillian has already implement start-granule times on OggMIDI, which has
been available for
> some time (tho his orders by end-time, which breaks muxing, and needs to
be changed).
>

As i mentioned before i don't see why both timestamps shouldn't be included
in the framing data. They are both useful for their own purposes, and even
for the now end-time codecs, there are often times where start times have to
be determined in not so efficient ways, which providing both times would
alleviate.

<p>> Now apply this to a subtitle format.  Say you wanted a text codec which
fades in/out.  With
> state changes, you would have to redefine the color (or something) every
granule, a seperate
> packet for each to ensure that it can be muxed properly.  With the current
system, you could
> simply say "this phrase starts at this granule and fades out for this
duration".  Not that
> such a system is being put into Writ anytime soon, but that's another good
example.
>

Not necessarily true, there's no reason why a state change packet can't
indicate begin fade. ie
START, START FADE, END

where end could be optional if start fade fades to nothing. It would just be
effectively signalling the state change a few seconds earlier than it
finalises.

You are also assuming that subtitles should be "images", which is a pretty
inefficient way to transmit them.

In my opinion, subtitles should be text. For three main reasons.
1) It is a more efficient way to transmit them.
2) It allows search engines to use text searches.
3) It allows more fleixibility to the player to use different fonts or
colours if the user so desires.

I think a better approach is for setup headers to contain "font/display"
information, this way the "image" data is only defined once per physical
stream. This means a mechanism for associating for associating a track or
phrase to a display characteristic is needed. Also, if no display
characteristics are offered, the player can use a default one.

Furthermore, for end users, if they decide they don't like the fancy script
font the author thought was so cool, he can just tell his player to display
the text using his prefered display characteristics. Similarly for text
size, someone with not so good vision may not like the cool micro font the
author decided to use, they may want to use a bigger font that is easier to
read.

This mechanism provides many advantages, it lets authors create artistic
subtitling as the default display characteristics for a stream, but it also
allows the user to override that choice and choose display characteristics
suitable to them.

Also, another alternative to which would completely solve the missing
subtitle problem (though it does introduce it's own issues) is for subtitles
to have a pointer back to their predecessor. As muxing is a forward
operation this is not overly complex for a muxer to imlpement as all data is
known at the specified time point. Though it does introduce some special
requriements in seeking. But assuming we are resigned to the periodic
refresh strategy, then this should not be overly taxing. And it also
introduces new issues to splitting and remuxing streams.

Zen.
> --- >8 ----
> List archives:  http://www.xiph.org/archives/
> Ogg project homepage: http://www.xiph.org/ogg/
> To unsubscribe from this list, send a message to
'vorbis-dev-request at xiph.org'
> containing only the word 'unsubscribe' in the body.  No subject is needed.
> Unsubscribe messages sent to the list will be ignored/filtered.
>
>
>

<p>--- >8 ----
List archives:  http://www.xiph.org/archives/
Ogg project homepage: http://www.xiph.org/ogg/
To unsubscribe from this list, send a message to 'vorbis-dev-request at xiph.org'
containing only the word 'unsubscribe' in the body.  No subject is needed.
Unsubscribe messages sent to the list will be ignored/filtered.