[xiph-cvs] r6144 - theora/trunk/doc

giles at xiph.org giles at xiph.org
Sat Mar 20 21:24:20 PST 2004



Author: giles
Date: 2004-03-21 00:24:19 -0500 (Sun, 21 Mar 2004)
New Revision: 6144

Added:
   theora/trunk/doc/spec.bib
   theora/trunk/doc/spec.tex
Log:
Initial specification-in-progress by Timothy Terriberry.

<p>Added: theora/trunk/doc/spec.bib
===================================================================
--- theora/trunk/doc/spec.bib	2004-03-21 02:01:40 UTC (rev 6143)
+++ theora/trunk/doc/spec.bib	2004-03-21 05:24:19 UTC (rev 6144)
@@ -0,0 +1,7 @@
+ at MISC{Mel04,
+  author="Mike Melanson",
+  title="{VP3} Bitstream Format and Decoding Process",
+  howpublished="\url{http://home.pcisys.net/~melanson/codecs/vp3-format.txt}",
+  month="Mar.",
+  year=2004
+}

<p>Property changes on: theora/trunk/doc/spec.bib
___________________________________________________________________
Name: svn:executable
   + *

Added: theora/trunk/doc/spec.tex
===================================================================
--- theora/trunk/doc/spec.tex	2004-03-21 02:01:40 UTC (rev 6143)
+++ theora/trunk/doc/spec.tex	2004-03-21 05:24:19 UTC (rev 6144)
@@ -0,0 +1,493 @@
+\documentclass[11pt,letterpaper]{article}
+
+\usepackage{latexsym}
+\usepackage{amssymb}
+\usepackage{amsmath}
+\usepackage{graphicx}
+\usepackage[pdfpagemode=None,pdfstartview=FitH,pdfview=FitH,colorlinks=true]%
+ {hyperref}
+
+\newtheorem{theorem}{Theorem}[section]
+\newcommand{\qi}{\ensuremath{\mathit{qi}}}
+
+\pagestyle{headings}
+\bibliographystyle{alpha}
+
+\title{Theora I Specification}
+\author{Xiph.org Foundation}
+\date{\today}
+
+\begin{document}
+
+\maketitle
+\tableofcontents
+\newpage
+
+\section{Introduction and Description}
+
+This section provides a high level description of the Theora codec's
+ construction.
+A bit-by-bit specification appears beginning in Section~\ref{sec:bitpacking}.
+The later sections assume a high-level understanding of the Theora decode
+ process, which is provided below.
+
+\subsection{Overview}
+
+Theora is a general purpose, lossy video codec.
+It is based off the VP3.2 video codec produced by On2 Technologies
+ (\url{http://www.on2.com/}).
+On2 Technologies has released the VP3.2 source code under a BSD license, along
+ with an irrevocable, royalty-free license to any patent claims it might have
+ over the software and any derivatives.
+No formal specification exists for the VP3.2 format beyond this source code,
+ though Mike Melanson maintains a detailed description \cite{Mel04}.
+Portions of this specification were adopted from his text with permission.
+
+\subsubsection{VP3.2 and Theora}
+
+Theora contains a superset of the features that were available in the original
+ VP3.2 codec.
+Content encoded with VP3.2 can be losslessly transcoded into the Theora format.
+Theora content cannot, in general, be losslessly transcoded into the VP3.2
+ format.
+If a feature is not available in the original VP3.2 format, this will be
+ mentioned when that feature is defined.
+A complete list of these features appears in Appendix~REF.
+
+\subsubsection{Video Formats}
+
+Theora I currently supports progressive video data of arbitrary dimensions in
+ one of several 8-bit $Y'C_bC_r$ color spaces.
+The precise definition the color spaces supported appears in Section~REF.
+Three different chroma subsampling formats are supported: 4:2:0, 4:2:2,
+ and 4:4:4.
+The precise details of each of these formats and their sampling locations are
+ described in Section~REF.
+
+The Theora I format does not support interlaced material, bit-depths larger
+ than 8 bits per component, nor alternate color spaces such as RGB or
+ arbitrary multi-channel spaces.
+Support for interlaced material is planned for a future version.
+Support for increased bit depths or additional color spaces is not being
+ considered as of the time of this writing.
+
+\subsubsection{Classification}
+
+Theora I is a block-based lossy transform codec that utilizes an
+ $8\times 8$ Type-II Discrete Cosine Transform and block-based motion
+ compensation.
+This places it in the same class of codecs as MPEG-1, -2, -4, and H.263.
+The details of how individual blocks are organized and how DCT coefficients are
+ organized in the bitstream differ from these codecs substantially, however.
+Theora supports only intra frames (I frames in MPEG) and inter frames (P frames
+ in MPEG).
+In the current version of Theora, there is no equivalent to the bi-predictive
+ frames (B frames) found in MPEG codecs.
+
+\subsubsection{Assumptions}
+
+The Theora codec design assumes a complex, psychovisually-aware encoder and a
+ simple, low-complexity decoder.
+%TODO: Talk more about implementation complexity.
+
+Theora provides none of its own framing, synchronization, or protection against
+ errors; it is solely a method of accepting input video frames and compressing
+ these frames into raw, unformatted `packets'.
+The decoder then accepts these raw packets in sequence, decodes them,
+ and synthesizes a fascimile of the original video frames from them.
+Theora is a free-form variable bit rate (VBR) codec, and packets have no
+ minimum size, maximum size, or fixed/expected size.
+
+Theora packets are thus intended to be used with a transport mechanism that
+ provides free-form framing, synchronization, positioning, and error correction
+ in accordance with these design assumptions, such as Ogg (for file transport)
+ or RTP (for network multicast).
+For the purposes of a few examples in this document, we will assume that Theora
+ is to embedded in an Ogg stream specifically, although this is by no means a
+ requirement or fundamental assumption in the Theora design.
+
+The specification for embedding Theora into an Ogg transport stream is in
+ Appendix~REF.
+
+\subsubsection{Codec Setup and Probability Model}
+
+Theora's heritage is the proprietary commerical codec VP3.2, and thus maintains
+ a fair amount of inflexibility when compared to the first Xiph.org codec,
+ Vorbis, which was designed as a research codec.
+However, in order to allow some room for encoder improvement, Theora adopts
+ some of the configurable aspects of codec setup that are present in Vorbis.
+
+Theora makes the same controversial design decision that Vorbis made to include
+ the entire probability model for the DCT coefficients and all the quantization
+ parameters in the bitstream headers.
+This is often several hundred fields.
+This makes it impossible to begin decoding at any frame in the stream without
+ having previously fetched the codec info and codec setup headers.
+
+\begin{verse}
+{\bf Note:} Theora {\em can} initiate decode at an arbitrary intra-frame packet
+ within a bitstream so long as the codec has been initialized with the setup
+ headers.
+\end{verse}
+
+Thus, Theora headers are both required for decode to begin and relatively large
+ as bitstream headers go.
+The header size is unbouded, although for stream a rule-of-thumb of 16kB or
+ less is recommended, and Xiph.org's Theora encoder follows this suggestion.
+%TODO: Is 8kB enough? My setup header is 7.4kB, that doesn't leave much room
+% for comments.
+
+Our own design work indicates that the primary liability of the required header
+ is in mindshare; it is an unusual design and thus causes some amount of
+ complaint among engineers as this runs against current design trends, and also
+ points out limitations in some existing software/interface designs.
+However, we find that it does not fundamentally limit Theora's suitable
+ application space.
+
+\subsubsection{Format Specification}
+
+The Theora format is well-defined by its decode specification; any encoder that
+ produces packets that are correctly decoded by an implementation following
+ this specification may be considered a proper Theora encoder.
+A decoder must faithfully and completely implement the specification defined
+ herein %, except where noted,
+ to be considered a proper Theora decoder.
+Where appropriate, a non-normative description of encoder processes may be
+ described.
+These sections will be marked as such, and a proper Theora encoder is not
+ bound to follow them.
+
+%TODO: \subsubsection{Hardware Profile}
+
+\subsection{Decoder Configuration}
+
+Decoder setup consists of configuration of the quantization matrices and the
+ Huffman codebooks for the DCT coefficients.
+The remainder of the decoding pipeline is not configurable.
+
+\subsubsection{Global Configuration}
+
+The global codec configuration consists of a few video related fields, such as
+ frame rate, frame size, picture size and offset, aspect ratio, color space,
+ pixel format, and a version number.
+The version number is divided into a major version, a minor version, amd a
+ minor revision number.
+These are `3', `2', and `0', respectively, due to Theora's origin as the VP3.2
+ codec.
+
+\subsubsection{Quantization Matrices}
+
+Theora allows up to 384 different quantization matrices to be defined, one for
+ each {\em quantization type} (intra or inter), {\em color plane} ($Y'$, $C_b$,
+ or $C_r$), and {\em quantization index}, \qi, which ranges from zero to 63,
+ inclusive.
+The quantization index generally represents a progressive range of quality
+ levels, from low quality near zero to high quality near 63.
+However, the interpretation is arbitrary, and it is possible, for example, to
+ partition the scale into two completely separate ranges with 32 levels each
+ that are meant to represent different classes of source material.
+
+Each quantization matrix is an $8\times 8$ matrix of 16-bit values, which is
+ used to quantize the output of the $8\times 8$ DCT.
+Quantization matrices are specified using three components: a {\em base matrix}
+ and two {\em scale values}.
+The first scale value is the {\em DC scale}, which is applied to the DC
+ component of the base matrix.
+The second scale value is the {\em AC scale}, which is applied to all the other
+ components of the base matrix.
+
+There are 64 DC scale values and 64 AC scale values, one for each \qi value.
+There is a set of base matrices for each quantization type and each color
+ plane.
+The bitstream specifies this set by defining a base matrix for a sparse subset
+ of the possible \qi values, including at least zero and 63.
+The base matrices for the remainder of the \qi values are computed using linear
+ interpolation.
+This configuration allows the quantization matrices to approximate the complex,
+ non-linear processes of the human visual system as the \qi value varies.
+
+Finally, because the in-loop deblocking filter strength depends on the strength
+ of the quantization matrices defined in this header, a table of 64 {\em loop
+ filter limit values} is defined, one for each \qi value.
+
+The precise specification of how all of this information is decoded appears in
+ Section~REF.
+
+\subsubsection{Huffman Codebooks}
+
+Theora uses 80 configurable binary Huffman codes to represent the 32 tokens
+ used to encode DCT coefficients.
+Each of the 32 token values has a different semantic meaning and is used to
+ represent single coefficient values, zero runs, combinations of the two, and
+ `end-of-block' markers.
+
+The 80 codes are divided up into five groups of 16, with each group
+ corresponding to a set of DCT coefficient indices.
+The first group corresponds to the DC coefficient, while the remaining groups
+ correspond to different subsets of the AC coefficients.
+Within each frame, two 4-bit codebook indices are stored.
+The first selects which codebook to use from the DC coefficient group, while
+ the second selects which codebook to use from {\em all} of the AC coefficient
+ groups.
+
+The precise specification of how the codebooks are decoded appears in
+ Section~REF.
+
+\section{Bitpacking Convention}
+\label{sec:bitpacking}
+
+\subsection{Overview}
+
+The Theora codec uses relatively unstructured raw packets containing
+ binary integer fields of arbitrary width.
+Logically, each packet is a bitstream in which bits are written one-by-one by
+ the encoder and then read one-by-one in the same order by the decoder.
+Most current binary storage arrangements group bits into a native storage unit
+ of eight bits (octets), sixteen bits, thirty-two bits, or less commonly other
+ fixed sizes.
+The Theora bitpacking convention specifies the correct mapping of the logical
+ packet bitstream into an actual representation in fixed-width units.
+
+\subsubsection{Octets and Bytes}
+
+In most contemporary architectures, a `byte' is synonymous with an `octect',
+ that is, eight bits.
+This has not always been the case; seven, ten, eleven, and sixteen bit `bytes'
+ have been used.
+For purposes of the bitpacking convention, a byte implies the smallest native
+ integer storage representation offered by a platform.
+On modern platforms, this is generally assumed to be eight bits, not
+ necessarily because of the processor, but because of the file system/memory
+ architecture.
+Modern file systems invariably offer bytes as the fundamental atom of storage.
+
+The most ubiquitous architectures today consider a `byte' to be an octet.
+Note, however, that the Theora bitpacking convention is still well defined for
+ any native byte size;
+Theora uses the native bit-width of a given storage system.
+This document assumes that a byte is one octet for purposes of example only.
+
+\subsubsection{Words and Byte Order}
+
+A `word' is an integer size that is a grouped multiple of the byte size.
+Most architectures consider a word to be a group of two, four, or eight bytes.
+Each byte in the word can be ranked by order of `significance', e.g. the
+ significance of the bits in each byte when storing a binary integer in the
+ word.
+Several byte orderings are possible in a word.
+The common ones are
+\begin{itemize}
+\item{Big-endian:}
+in which the most significant byte comes first, e.g. 3-2-1-0,
+\item{Little-endian:}
+in which the least significant byte comes first, e.g. 0-1-2-3, and less
+ commonly
+\item{Mixed-endian:}
+e.g. 3-1-2-0 or 0-2-1-3.
+\end{itemize}
+
+The Theora bitpacking convention specifies storage and bitstream manipulation
+ at the byte, not word, level.
+Thus host word ordering is of a concern only during optimization, when writing
+ code that operates on a word of storage at a time rather than a byte.
+Logically, bytes are always encoded and decoded in order from byte zero through
+ byte $n$.
+
+\subsubsection{Bit Order}
+
+A byte has a well-defined `least significant' bit (LSb), which is the only bit
+ set when the byte is storing the two's complement integer value $+1$.
+A byte's `most significant' bit (MSb) is at the opposite end of the byte.
+Bits in a byte are numbered from zero at the LSb to $n$ for the MSb, where
+ $n=7$ in an octet.
+
+\subsection{Coding Bits into Bytes}
+
+The Theora codec needs to encode arbitrary bit-width integers from zero to 32
+ bits wide into packets.
+These integer fields are not aligned to the boundaries of the byte
+ representation; the next field is read at the bit position immediately
+ after the end of the previous field.
+
+The decoder logically unpacks integers by first reading the MSb of a binary
+ integer from the logical bitstream, followed by the next least significant
+ bit, etc., until the requested number of bits have been read.
+When unpacking the bytes into bits, the decoder begins by reading the MSb of
+ the integer to be read from the most significant unread bit position of the
+ source byte, followed by the next-most significant bit position of the
+ destination integer, and so on up to the requested number of bits.
+Note that this is a change from the earlier Xiph.org codec, Vorbis I, which
+ begins encoding from the LSb of the source integer, reading it from the LSb of
+ the source byte.
+When all the bits of the current source byte are read, decoding continues with
+ the MSb of the next byte.
+Any unfilled bits in the last byte of the packet must be cleared to zero by the
+ encoder.
+
+\subsubsection{Signedness}
+
+The binary integers decoded by the above process may be either signed or
+ unsigned.
+This may change from integer to integer, and this document will specify how
+ each should be interpreted as it is read.
+That is, depending on context, the three bit binary pattern `b111' can be taken
+ to represent either `$7$' as an unsigned integer or `$-1$' as a signed, two's
+ complement integer.
+
+\subsubsection{Encoding Example}
+
+The following example shows the state of an (8-bit) byte stream after several
+ binary integers are encoded, including the location of the put pointer for the
+ next bit to write to and the total length of the stream in bytes.
+
+Encode the 4 bit unsigned integer value `12' (b1100) into an empty byte stream.
+
+\begin{tabular}{rccccccccl}
+                    &   &   &   &   &$\downarrow$&&&&            \\
+        \hfill\vline& 7 & 6 & 5 & 4 & 3 & 2 & 1 & 0 &            \\\cline{1-9}
+byte 0  \hfill\vline& 1 & 1 & 0 & 0 & 0 & 0 & 0 & 0 &$\leftarrow$\\
+byte 1  \hfill\vline& 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 &            \\
+byte 2  \hfill\vline& 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 &            \\
+byte 3  \hfill\vline& 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 &            \\
+\hfill$\vdots$\hfill\vline&\multicolumn{8}{c}{$\vdots$}&         \\
+byte $n$\hfill\vline& 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 &
+byte stream length: 1 byte
+\end{tabular}
+\vspace{\baselineskip}
+
+Continue by encoding the 3 bit signed integer value `-1' (b111).
+
+\begin{tabular}{rccccccccl}
+                    &   &   &   &   &   &   &   &$\downarrow$&   \\
+        \hfill\vline& 7 & 6 & 5 & 4 & 3 & 2 & 1 & 0 &            \\\cline{1-9}
+byte 0  \hfill\vline& 1 & 1 & 0 & 0 & 1 & 1 & 1 & 0 &$\leftarrow$\\
+byte 1  \hfill\vline& 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 &            \\
+byte 2  \hfill\vline& 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 &            \\
+byte 3  \hfill\vline& 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 &            \\
+\hfill$\vdots$\hfill\vline&\multicolumn{8}{c}{$\vdots$}&         \\
+byte $n$\hfill\vline& 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 &
+byte stream length: 1 byte
+\end{tabular}
+\vspace{\baselineskip}
+
+Continue by encoding the 7 bit integer value `17' (b0010001).
+
+\begin{tabular}{rccccccccl}
+                    &   &   &   &   &   &   &$\downarrow$&&      \\
+        \hfill\vline& 7 & 6 & 5 & 4 & 3 & 2 & 1 & 0 &            \\\cline{1-9}
+byte 0  \hfill\vline& 1 & 1 & 0 & 0 & 1 & 1 & 1 & 0 &            \\
+byte 1  \hfill\vline& 0 & 1 & 0 & 0 & 0 & 1 & 0 & 0 &$\leftarrow$\\
+byte 2  \hfill\vline& 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 &            \\
+byte 3  \hfill\vline& 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 &            \\
+\hfill$\vdots$\hfill\vline&\multicolumn{8}{c}{$\vdots$}&         \\
+byte $n$\hfill\vline& 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 &
+byte stream length: 2 bytes
+\end{tabular}
+\vspace{\baselineskip}
+
+Continue by encoding the 13 bit integer value `6969' (b11011 00111001).
+
+\begin{tabular}{rccccccccl}
+                    &   &   &   &$\downarrow$&&&&   &            \\
+        \hfill\vline& 7 & 6 & 5 & 4 & 3 & 2 & 1 & 0 &            \\\cline{1-9}
+byte 0  \hfill\vline& 1 & 1 & 0 & 0 & 1 & 1 & 1 & 0 &            \\
+byte 1  \hfill\vline& 0 & 1 & 0 & 0 & 0 & 1 & 1 & 1 &            \\
+byte 2  \hfill\vline& 0 & 1 & 1 & 0 & 0 & 1 & 1 & 1 &            \\
+byte 3  \hfill\vline& 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 &$\leftarrow$\\
+\hfill$\vdots$\hfill\vline&\multicolumn{8}{c}{$\vdots$}&         \\
+byte $n$\hfill\vline& 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 &
+byte stream length: 4 bytes
+\end{tabular}
+\vspace{\baselineskip}
+
+\subsubsection{Decoding Example}
+
+The following example shows the state of the (8-bit) byte stream encoded in the
+ previous example after several binary integers are decoded, including the
+ location of the get pointer for the next bit to read.
+
+Read a two bit unsigned integer from the example encoded above.
+
+\begin{tabular}{rccccccccl}
+                    &   &   &$\downarrow$&&&&   &   &            \\
+        \hfill\vline& 7 & 6 & 5 & 4 & 3 & 2 & 1 & 0 &            \\\cline{1-9}
+byte 0  \hfill\vline& 1 & 1 & 0 & 0 & 1 & 1 & 1 & 0 &$\leftarrow$\\
+byte 1  \hfill\vline& 0 & 1 & 0 & 0 & 0 & 1 & 1 & 1 &            \\
+byte 2  \hfill\vline& 0 & 1 & 1 & 0 & 0 & 1 & 1 & 1 &            \\
+byte 3  \hfill\vline& 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 &
+byte stream length: 4 bytes
+\end{tabular}
+\vspace{\baselineskip}
+
+Value read: 3 (b11).
+
+Read another two bit unsigned integer from the example encoded above.
+
+\begin{tabular}{rccccccccl}
+                    &   &   &   &   &$\downarrow$&&&            &            \\
+        \hfill\vline& 7 & 6 & 5 & 4 & 3 & 2 & 1 & 0 &            \\\cline{1-9}
+byte 0  \hfill\vline& 1 & 1 & 0 & 0 & 1 & 1 & 1 & 0 &$\leftarrow$\\
+byte 1  \hfill\vline& 0 & 1 & 0 & 0 & 0 & 1 & 1 & 1 &            \\
+byte 2  \hfill\vline& 0 & 1 & 1 & 0 & 0 & 1 & 1 & 1 &            \\
+byte 3  \hfill\vline& 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 &
+byte stream length: 4 bytes
+\end{tabular}
+\vspace{\baselineskip}
+
+Value read: 0 (b00).
+
+Two things are worth noting here.
+\begin{itemize}
+\item
+Although these four bits were originally written as a single four-bit integer,
+ reading some other combination of bit-widths from the bitstream is well
+ defined.
+There are no artificial alignment boundaries maintained in the bitstream.
+\item
+The first value is the integer `$3$' only because the context stated we were
+ reading an unsigned integer.
+Had the context stated we were reading a signed integer, the returned value
+ would have been the integer `$-1$'.
+\end{itemize}
+
+\subsubsection{End-of-Packet Alignment}
+
+The typical use of bitpacking is to produce many independent byte-aligned
+ packets which are embedded into a larger byte-aligned container structure,
+ such as an Ogg transport bitstream.
+Externally, each bitstream encoded as a byte stream must begin and end on a
+ byte boundary.
+Often, the encoded bitstream is not an integer number of bytes, and so there is
+ unused space in the last byte of a packet.
+
+Unused space in the last byte of a packet is always zeroed during the encoding
+ process.
+Thus, should this unused space be read, it will return binary zeroes.
+There is no marker pattern or stuffing bits that will allow the decoder to
+ obtain the exact size, in bits, of the original bitstream.
+This knowledge is not required for decoding.
+
+Attempting to read past the end of an encoded packet results in an
+ `end-of-packet' condition.
+Any further read operations after an `end-of-packet' condition shall also
+ return `end-of-packet'.
+Unlike Vorbis, Theora does not use truncated packets as a normal mode of
+ operation.
+Therefore if a decoder encounters the `end-of-packet' condition during normal
+ decoding, it may attempt to use the bits that were read to recover as much of
+ encoded data as possible, signal a warning or error, or both.
+
+\subsubsection{Reading Zero Bit Integers}
+
+Reading a zero bit integer returns the value `$0$' and does not increment
+ the stream pointer.
+Reading to the end of the packet, but not past the end, so that an
+ `end-of-packet' condition is not triggered, and then reading a zero bit
+ integer shall succeed, returning `$0$', and not trigger an `end-of-packet'
+ condition.
+Reading a zero bit integer after a previous read sets the `end-of-packet'
+ condition shall fail, also returning `end-of-packet'.
+
+\bibliography{spec}
+
+\end{document}

<p>Property changes on: theora/trunk/doc/spec.tex
___________________________________________________________________
Name: svn:executable
   + *

--- >8 ----
List archives:  http://www.xiph.org/archives/
Ogg project homepage: http://www.xiph.org/ogg/
To unsubscribe from this list, send a message to 'cvs-request at xiph.org'
containing only the word 'unsubscribe' in the body.  No subject is needed.
Unsubscribe messages sent to the list will be ignored/filtered.



More information about the commits mailing list