[ogg-dev] New Ogg Dirac mapping draft

Wed Aug 13 13:08:15 PDT 2008

On 2008-08-12, Ralph Giles <giles at xiph.org> wrote:
> David Flynn has proposed a new Ogg Dirac mapping.

I thought it'd be a good idea to explain some of the rationale in why we
want to change the definition of granulepos in the ogg-dirac mapping.

Terms used in this document:
 - GP64 = The 64bit granule_position as found in the page header.
 - GPH+L = Granule pos high + low as split by granule_shift.
 - ST = System Time; this is the monotonically increasing decoder clock
 - PT = Presentation Time; Picture is displayed when PT = ST, which implies AV sync.

NOTE, we will not use the terms I,P,B -- they are mpeg2 terms which do not
map to constructs in dirac or h264 properly.

Properties of an Out-of-order video codec (dirac,h264,vc-1,mpeg2)
 - Each picture has a unique PT.
 - Pictures in the stream are not in PT order.
 - The decoder reorders pictures at output into PT order.
 - ST != PT in stream order (ie, input to decoder).

Defacto rules of ogg (I've not found these actually written down anywhere):
 [A1] One of GP64 or GPH+L must increase for each packet
      For in-order codecs using keyframe-granuleshift, both are true.
 [A2] GPH+L == time.
   All codecs so far are inorder, so ST=PT=time.
 [A3] Page flushes are NOT invariant across remuxes.

 The ogg RFC does states that GP64 is codec specific without any
 restriction.

What is needed to decode & display Out-of-order coded video?
 Each picture must have a unique & accurate(correct) PT.
 ST needs to be derived from the stream correctly:
   - Can interpolate ST for a particular picture
   - Can not determine the starting value of ST from the first picture.
   - This happens in streaming, example:
      PT: 14 10 11 12 13
      ST: 10 11 12 13 14

What is problematic with the xiph mapping?
  - Here is an example using the xiph mapping:
      Sync point: V       V                 V
      PT(actual): 0 3 1 2 6 4 5 9 7 8 c a b d
      GP_high:    0 0 0 0 6 6 6 6 6 6 6 6 6 d
      GP_low:     1 1 2 3 1 1 1 1 2 3 3 4 5 1
      GPH+L-1:    0 0 1 2 6 6 6 6 7 8 8 9 a d

  - Each picture does not have a unique value for granulepos
     => Cannot determine unique&correct PT
     => Cannot determine correct ST
     => If (due to paging) no GP64 is available for a frame,
        it is impossible to correctly interpolate the value of PT.
     => Don't know when to display pictures

  - Seeking is difficult:
     -  Want to seek to frame N
     -  GPH+L is non-unique (don't know if the right one has been found)
     => Some values of GPH+L do not exist (searches may fail)
     -  and GPH+L != N (ie, may find the wrong frame)

  - Locating the sync point (eg, after seek) is irritating
     - GP_low != to number of packets(pictures) since sync point.
     => Have to search backwards until GP_high changes

  - Copes badly with open gop:
    To correctly decode picture(PT=4) in above example, the sync point it
    depends upon is picture(PT=0).
    However, this would violate the property of GP64(n) > GP64(n-1).

  - It requires that a page is flushed before transmitting a sync point
    so that a syncpoint is guaranteed to have a valid GP64.
    This violates axiom A3

Some comments on choice of GP64 in bbc mapping:
  Consider axiom A2 (GPH+L == time), assume this is PT.
  - PT (not ST) makes sense for AV sync
  - PT (not ST) makes sense for locating pictures (seek)
    although a naive sync will find the wrong picture.
  - The stream is in the order required to satisfy decoding
    dependencies, ie PT jumps around.
    This violates axiom A1 (GPH+L(n) > GPH+L(n-1)).
    This violates axiom A1 (GP64(n) > GP64(n-1)).

  Consider axiom A2 (GPH+L == time), assume this is ST.
  - Complies with axiom A1 (GPH+L(n) > GPH+L(n-1)).
  - Complies with axiom A1 (GP64(n) > GP64(n-1)).
  - Is not useful for AV sync.
  - Is not useful for seeking (you will end up with the wrong picture).

  => No good reason for GPH+L == ST
  .'. choose GPH+L = PT.

Some interactions with skeleton:
  >  ... allowing to map a granule position [GPH+L] to time by calculating
  >  "granulepos [GPH+L] / granulerate"
    -- http://wiki.xiph.org/OggSkeleton

 '.' the only useful time to decoding is the PT
  => GPH+L = PT.

  Ie, you can seek based upon presentation time, however a binary
  search can hit a reordered picture and therefor choose the wrong
  picture at the end.  The error is +/- one GOP.

 ---
  > Restart after seek still requires new code; that part of skeleton
  > doesn't work.
    -- http://article.gmane.org/gmane.comp.multimedia.ogg.devel/1118

  Actually, ogg skeleton does not provide such information to any GOP
  based video codec.
    - It only has Preroll, which in a GOP based video
      codec is constantly varying.
    - Preroll only makes sense for video when using something such as
      Ponly-with-intra-slice-refresh, where there are no keyframes.

Some final remarks:
  - one-packet-per-page:
    It has been said that one-packet-per-page (ie, a page flush per
    packet) upsets remuxing due to axiom A3.  however, it is a requirement
    of the xiph mapping that a page flush occurs before a sync point.
  - To resolve the above contradiction, i assume that axiom A3 is invalid
  - The bbc mapping allows reconstruction of PT, ST and distance to
    syncpoint without any a priori information.
  - The bbc mapping does not require peeking into the packet payload to
    fill in the blanks
  - If GPH+L is to be useful, it is not possible to comply with
    axiom A1 (GPH+L(n) > GPH+L(n-1)).

Stop press:
  - I've realised that it is possible to rearrange GP64 in such a way that:
    + Complies with axiom A1 (GP64(n) > GP64(n-1)).
    + violates axiom A1 (GPH+L(n) > GPH+L(n-1)).
    However, i doubt that is any use, since i hope any sane demuxer searches
    based upon GPH+L.

Regards,
..david