[theora] Indexing Ogg files for faster seeking

Wed Oct 7 17:48:30 PDT 2009

Below is another version of the index track spec with one index packet 
per stream.

The index format is still quite simple, though not as compact as the 
previous "one merged index per file" approach. I estimate that if you 
index two tracks, assuming one key point every two seconds from both 
tracks, that in practice it will take approximately 70KB per hour of 
video (11.6KB per 10 minutes) to index two-track video. That's about 20 
bytes of index per second of video.

With the original "one merged index per file" approach it's about half 
that, but I think the added size is an acceptable trade off. I imagine 
the majority of video out there on the internet is under 10 minutes long 
anyway (requiring a 12KB index...), and when playing files over a 
network, most reasonable quality videos will require about 100KB/s of 
bandwidth to playback smoothly. If if you've got a connection fast 
enough for streaming video, you won't notice downloading an index.

You can tweak the index-keyframe interval to reduce the index size as 
well, though that erodes the benefit of the index for network playback.

I've implemented this in my indexer on a new branch on my GitHub account:
http://github.com/cpearce/OggIndex/tree/index-per-stream

New spec here:
http://github.com/cpearce/OggIndex/blob/index-per-stream/IndexSpecificationVersion1.txt 

Firefox builds which can handle new index format here:
https://build.mozilla.org/tryserver-builds/cpearce@mozilla.com-try-4768e6238638/

Demo here:
http://pearce.org.nz/video/indexed-seek-demo.html

New Proposed Index Track Format:
<quote>

An Ogg index track starts with an identifier header packet which
contains the following data, in the following order:

   * The identifier "index\0".
   * The index version format number, as a 1 byte unsigned integer. This
     specification describes version 1, so this field should have the
     value 0x01.
   * The playback start time, in milliseconds, as an 8 byte unsigned
     integer, this is the presentation time of the first frame.
   * The playback end time, in milliseconds, as an 8 byte unsigned
     integer, this is the end time of the last frame.
   * The length of the indexed segment, in bytes, as an 8 byte unsigned
     integer.

The track then contains secondary header packets, which contain the
actual indexes. These are the "index packets", and each must begin on a
new page, but they may span multiple pages. There is one index packet
for each content stream in the Ogg segment, and they appear in
increasing order of the streams' serialno. Each index packet contains
the following:

   * The serialno of the stream as a 4 byte field.
   * The number of key points in the index packet, 'n', as a 4 byte
     unsigned integer.

   * 'n' key points, each of which contain, in the following order:
     - the page's byte offset as an 8 byte unsigned integer, followed by
     - the checksum of the page found at the offset, as a 4 byte field,
       followed by
     - the presentation time in milliseconds of the key point, as an 8
       byte unsigned integer.

The key points are stored in increasing order by offset. The
presentation time of the key point is calculated from the granulepos.
[...]

The last packet in the track is an empty EOS packet, which must start on
a new page.

</quote>

Note that this format can be encoded in one pass. If you know the 
duration of the media, you can decide the keyframe interval (say one 
every 2 seconds, which is roughly ffmpeg2theora's default for theora 
anyway) and then allocate the required space in the index packets and 
come back and fill it in once you've encoded the media.

Comments? Questions etc?

Chris P.