[theora-dev] Keyframe seeking in Ogg and spec
Monty
xiphmont at xiph.org
Fri Sep 13 11:50:09 PDT 2002
Folks have noticed that the documentation is semi-silent about how to
properly encode the granule position and interleave synchronization of
keyframe-based video. The primary reasons for this:
a) we at Xiph hadn't had to do it yet
b) there are several easy possibilities, and the longer we had to
think about it before mandating One True Spec, the better that spec
would likely be.
The lack of a painfully explicit spec has led to the theory that it's
not possible; that's not true, there are a few ways to do it. Several
require no extension to Ogg stream v 0. A last way requires an extra
field (a point against it), but does not actually break any stream
that currently exists.
The time has come to lay down the spec as we're currently building the
real abstraction layers in a concrete Ogg framework now where the Ogg
engine, the codecs, and the overarching Ogg control layers are neatly
put into boxes connected in formalized ways. Below I go into detail
about each scheme in a 'thinking aloud' sort of way. This is not
because I haven't already given the matter sufficient thought, it is
because I wish to give the reader sufficient background information to
understand why one way is better than the others. This is not a call
for input so much as an educational effort (and a public sanity check
of my thinking; please do pipe up if it appears I missed a salient
point).
Starting Assumptions:
1) Ogg is not a non-linear format. It is not a replacement for the
scripting system of a DVD player. It is a media transport format
designed to do nothing more than deliver content, in a stream, and
have all the pieces arrive on time and in sync. It is not designed to
*prevent* more complex use of content, it merely does not implement
anything beyond a linear representation of the data contained within.
If you want to build a real non-linear format, build it *from* Ogg,
not *into* Ogg. This has been the intent from day 1.
2) The Ogg layer does not know specifics of the codec data it's
multiplexing into a stream. It knows nothing beyond 'Oooo, packets!',
that the packets belong to different buckets, that the packets go in
order, and that packets have position markers. Ogg does not even have
a concept of 'time'; it only knows about the sequentially increasing,
unitless position markers. It is up to higher layers which have
access to the codec APIs to assign and convert units of framing or
time.
3) Given pre-cached decode headers, a player may seek into a stream at
any point and begin decode. It may be the case that audio may start
after video by a fraction of a second, or video might be blank until
the stream hits the next keyframe, but this simplest case must just
work, and there will be sufficient information to maintain perfect
cross-media sync.
4) (This departs from current reality, but it will be the reality very
soon; vorbisfile currently blurs the careful abstraction I'm about to
describe) Seeking at an arbitrary level of precision is a distributed
abstraction in the larger Ogg picture. At the lowest-level Ogg stream
abstraction, seeking is one operation: "find me the page from logical
stream 'n' with granule position 'x'". All more complex seeking
operations are a function of a higher-level layer (with knowledge of
the media types and codec in use) making intelligent use of this
lowest Ogg abstraction. The Ogg stream abstraction need deal with
nothing more complex than 'find this page'.
The various granulepos strategies for keyframes concern this last point.
The basic issue with video from which complexity arises is that frames
often depend on previous and possibly future frames. This happens in
a larger, general category of codecs whose streams may not begin
decode from just any packet as well as packets that may not represent
an entire frame, or even a fixed-time sampling algorithm. It is a
mistake to design a seeking system tied to an exact set of very
specific cases. While one could implement an explicit keyframe
mechanism at the Ogg level, this mechanism would not cover any of the
other interesting seeking cases while, as I'll show below, the
mechanism would not actually be necessary.
There will be a few complaints that Ogg is being unnecessarily subtle
and shifts a great deal of complexity into software which a few extra
page header fields could eliminate. Consider the following:
1) Ogg was designed to impose a roughly .5-1% over the raw packet data
over a wide range of packet usage patterns. 'A few extra fields'
begins inflating that figure for specific special cases that only
apply to a few stream types. Right now there is no header field that
is not general to every stream. There is no fat in the page headers.
2) The Ogg-level seeking algorithm is exceptionally simple and can be
described in a single sentence: "Find the earliest page with a
granulepos less than but closest to 'x'". This shifts the onus of
assembling more complex seeking operation requiring knowledge of a
specific media type into a higher layer that has knowledge of that
media type. The higher layer becomes responsible for determining for
what 'x' Ogg should search. The division of labor is clear and
sensible.
3) Complex, precise seeking operations are still contained entirely
within the framework, just at a higher layer than Ogg-stream. At no
time is an application developer required to deal with seeking
mechanisms within an Ogg stream or to manually maintain stream
synchronization.
High level handwaving- How seeking really works:
The granulepos is intended to mean, roughly, 'If I stop decode at the
end of this page, I will get data from my decoder up to position
'granulepos'. The granulepos simultaneously provides seeking
information and a 'length-of-stream' indicator. Depending on the
codec, it can also usually be used to indicate a timebase, but that
isn't our problem right now.
By inference, the granulepos is also used to construct a value 'y'
such that 'if I begin decode *from* point 'y', I will get data
beginning at position 'granulepos'. Although in some codecs, y ==
granulepos, that is not necessarily the case when decode can't begin
at any arbitrary packet. The granulepos encoding method candidates I
will now describe affect exactly the 'granulepos' to 'y' conversion
process. Note also that none of these affect Ogg, only the higher
decision-making layers... Different circumstanced necessitated by
different codecs can lead to different valid choices, all of which
work as far as Ogg is concerned. However, for our I-/P-/B-frame video
case, there is a pretty clear winner.
Strategy 1: Straight Granulepos, Keyframes Are Not Our Problem.
In this scheme, the granulepos is a simple frame counter. The
seeking decision-maker in the codec's framework plugin is
responsible for determining if a frame is a keyframe or not, and if
it can't begin decode from a given frame, it must request another
earlier frame until it finds a keyframe. If the codec so desires,
it can store 'what is my keyframe?' information in the stream
packets.
This case means that each seek to a *specific* frame in a video
stream will generally result in two Ogg seeks; a first seek to the
the requested frame, then a second seek backwards to find that
frame's keyframe.
A larger concern is the semantic accuracy of the granulepos; it's
intended to reflect position accurately when decoding forward. In
this scheme, it's fine for a P-frame to update the counter (as it
can be decoded going strictly forward), but B frames will also
advance the counter; they can't be decoded without subsequent P or I
frames. Thus, the semantic value of granulepos no longer strictly
represents 'we can decode up to 'granulepos' at the end of this
frame'.
Strategy 2: Granulepos Represents Keyframes Only
In this scheme, only keyframes update the granulepos (monotonically
or non-monotonically). It simplifies the seeking process to a
keyframe as an Ogg-level seek to page 'x' will always yield a page
with a keyframe. In addition, granulepos will also always mean 'we
can decode up to *at least* this point in the stream. If the stream
is truncated at P or B frames past granulepos, the extra frames can
be discarded. (A special case would need to be defined to terminate
a stream that doesn't end on an I frame).
The difficulty with this scheme is that it presents slightly more
for the software level decoder to track; a proper frame number could
not be determined internally without tracking from an I frame.
Also, the granulepos an Ogg page would not necessarily map to the
last packet on the page, or even any packet on that page; multiple
sequential pages could have the same granulepos. It is conceptually
slightly messy, although the 'messiness' does not make it at all
impractical.
Strategy 3: Granulepos Encodes Some State
In some ways, this strategy is the most semantically 'over clever',
but also the easiest to implement and the one that gives the most
correct, up to date sync information. Pending comments, it is the
I/P/B video strategy I currently favor.
The granulepos is 64 bits, a size that is absolutely necessary if,
for example, it represents the PCM sample count in an audio codec.
When being used to encode video frame number, however, it is
comparatively absurdly large*.
* note that although granulepos is not permitted to wrap around, we
can simply begin a new logical stream segment with a new serial
number should a 30fps video stream ever hit the ten-billion year
mark.
Thus we clearly have room to skim a few bits off the bottom of
granulepos to represent I, P or B frame. These bits are not used as
flags, but rather, frame representation becomes a counting problem;
We do this such that the count is still always strictly increasing.
For example, we know that I frames will never be more than 256
frames apart and P frames no more than 31 B frames apart, the
granulepos of an I frame can be defined to always be granulepos |
0xff == 0. If we can have up to seven intervening P frames, they
could be numbered in granulepos-of-iframe + 0x20, 0x40,
0x60... 0xe0. B frames between the I and P frames would use the
remaining five bits and be numbers as sub-I and sub-P frames 1
through 31. Thus, starting from zero, the frames/packets in the
pattern IPBBPBBI would be numbered 0x000, 0x020, 0x021, 0x022,
0x040, 0x041, 0x042, 0x100.
If we wish to preserve the ability to represent a timebase, the
granulepos number for I frames need not be increased monotonically
and shifted; it can be used to represent the frame number. The
above example becomes 0x000, 0x020, 0x021, 0x022, 0x040, 0x041,
0x042, 0x700. To get real frame number (from an I frame), we just
shift granulepos >> 8. This scheme can be taken further or modified
to get frame number from any video frame.
In this way, we can always seek, first time, to a desired key frame
page (by seeking to Ogg page 'x' where x | 0xff == 0). In
addition, each frame still has a unique frame number and also a
clear 'group' number, potentially useful information to the decoder.
Lastly, granulepos is still semantically correct, although it is
now, in a sense, representing a whole.fractional frame number for
buffering purposes.
Scheme Four: Extra 'Seekpos' Field / Straw Man
Another possibility requires extension of the current Ogg page
format. Although older players would reject any such extended pages
as invalid, we do have versioning and typing fields, so there's not
actually any compatibility problems with current Ogg pages... in the
future.
The idea in this scheme is to keep the current granulepos as a frame
number field (ala scheme 1), but also add a new field 'seekpos' that
is used, rather than granulepos, in seeking. The seekpos would
represent the number of the last keyframe that passed by.
advantages:
1) The net effect of this strategy is to modify scheme 1 to only
require one bisection seek rather than two. Some amount of code
simplification (over scheme 1) at the decision-making level.
disadvantages:
1) The Ogg format will need to be revved. No current (ala 1.0) Ogg
code will understand the new pages.
2) The header becomes larger, from a minimum size of 27 bytes to a
minimum size of 35.
3) This strategy only enhances keyframes; it is of no use in other
odd seeking cases.
4) Gives no more information than scheme 3, but is still more
complicated, both in code and API (Ogg would have to understand
keyframes).
Thus, there's no substantial reason to prefer extending the format
over a scheme that's possible within the existing framework. Note
that schemes 1-3 can all be implemented within the Ogg stream today.
Monty
--- >8 ----
List archives: http://www.xiph.org/archives/
Ogg project homepage: http://www.xiph.org/ogg/
To unsubscribe from this list, send a message to 'theora-dev-request at xiph.org'
containing only the word 'unsubscribe' in the body. No subject is needed.
Unsubscribe messages sent to the list will be ignored/filtered.
More information about the Theora-dev
mailing list