[vorbis] Video codec

Mon Sep 11 13:58:15 PDT 2000

Thomas Marshall Eubanks wrote:
> 
> Dear Jelle;
> 
>    Very nice summary. I have two comments (in -line).

I guess it would be good to find some set of limits/requirements of the
video codec per application area, such as:

Videoconferencing, xx..yy bits per pixel, <nnn ms delay, Q level
subjective image quality. comparable or better than standard XYZ at xxx
bps

Broadcast TV: xx..yy bits per pixel, fixed bit-rate comparable or better
to standard XYZ at xxx bps, variable bit-rate with a ceiling at NNN % of
nominal bit rate.

Recorded Video streams:  ....

Low frame rate webcams: .....

Email attached videos: .....

Moving parts of web pages: ....

etc, etc...

That may help the developers see if they've satisfied any potential
use(r)s yet.

More comments below, in-line.

>                                    Marshall Eubanks
> 
> Jelle Foks wrote:
> 
> > Just for clarity, so that we have the correct terminology and numbers,
> > and I'll raise some issues that I think should be considered when
> > designing Ogg Video.
> >
> > Digital Broadcast Quality Video is described in CCIR601/656, which is
> > basically the following:
> >
> > Active Frame Size    | Frame Rate   | Subsampling | Active pixels per
> > second
> > ---------------------+--------------+-------------+-------------------
> > NTSC: 720x480        | 1000/1001*30 |  4:2:2      | ~10.3M
> > PAL:  720x576        | 25           |  4:2:2      | ~10.3M
> >
> > A 'frame' is a full image of video. In interlaced video, a frame
> > consists of two fields, the even field and the odd field.
> >
> > The video signals are encoded in the YCbCr color space (Luminance +
> > Crominance-Blue + Crominance-Red). Each of the color components Y, Cb,
> > or Cr is called a 'subpixel'. A subpixel in CCIR601/656 has a precision
> > of 8 bits.
> > The subsampling of CCIR601/656 is called '4:2:2' subsampling in 'mpeg
> > terms', and means that the crominance pixels are decimated by factor of
> > two in the horizontal direction. The result is that color has only half
> > resolution in the horizontal direction (360x480NTSC/360x576PAL). To be
> > honest, this subsampling is the first step of lossy compression of a
> > factor  ((1+1+1)*8)/((1+0.5+0.5)*8)=1.5, because a 24bpp image is
> > described with an average of 16 bits per pixel after reduction of the
> > chrominance resolution.
> >
> > The number of 216Mbit mentioned here is CCIR601/656 video data including
> > the blanking and retrace interval overhead (a CCIR601/656 video stream
> > also contains non-active pixels, because it also contains the timing so
> > that the video data can easily be transformed to and from the analog
> > domain).
> >
> > My opinion is that, when discussing video compression, it is confusing
> > to speak of 'compression ratios', because it is never clear whether
> > compression ratio before or after subsampling is meant, and whether or
> > not non-active pixels were counted in the non-compressed stream.
> >
> > A factor of 100 compression of the 216Mbit stream would result in a
> > 2.16Mbit stream. However, a factor of 100 compression of the active
> > CCIR601/656 video pixels would result in a 10.3*16/100=1.65Mbit stream.
> > There is a 24% difference between the two numbers.
> >
> > I suggest using the term 'bits per pixel' to quantify the compression
> > ratio. With that number there are no unclarities and it's easy to
> > calculate the resulting video bit-rate given the video image resolution.
> > 'D1 at 1.5Mbps' is approx 0.15 bits per pixel, 'D1 at 3Mbps' is approx
> > 0.3 bits per pixel.
> >
> > Rough numbers: With JPEG compression, you get between 1-5 bits per
> > pixel, jpeg is mostly used in the range of 1-2 bits per pixel. JPEG200
> > claims to get 4-8 better compression than JPEG, if that is true it's
> > about the range of 0.15-1.25 bits per pixel. With MPEG compression, you
> > can get between 0.15-1.5 bits per pixel, depending on the encoder and
> > image quality of course (and the MPEG version, MPEG1, MPEG2, or MPEG4).
> > When counting uncompressed video as 24 bits per pixel, this explains the
> > claimed 100x compression of MPEG video at 0.24 bits per pixel. Below
> > 0.15 bits per pixel is often very agressive coding for applications such
> > as video conferencing, in which case large parts of the image are left
> > completely unchanged (H.263/H.26L).
> >
> 
> I think that the end of this should read
> 
> "in which case large parts of the image are left
> completely unchanged FROM FRAME TO FRAME"
> 
> The thing missing from this discussion is that aggressive compression of video
> always
> depends in some fashion on only encoding the difference between frames, not
> doing each frame from scratch. The efficiency of this depends on what signal
> is being encoded - good for talking heads on NPR, not so good for The Talking
> Heads in concert, much less NBA basketball or world cup soccer. The difference
> can easily be a factor of 3 to 6, even with MPEG type codecs that allow for
> blocks or objects to move from frame to frame. This means that some sort of "bit
> bucket"
> would be very useful in a full motion video codec, where more time is spent
> sending active scenes than passive ones. To do this means that you will
> run behind real time.

Either that, or variable bit-rate with a high ceiling (hmm, 'basketball
needs a high ceiling') and a large bit-pipe between transmitter and
receiver to allow that high ceiling, which in most cases is only
feasible when playing back from DVD, CDROM, or hard drive.

> Question : Is the Vorbis Video Codec to be used for video conferences or
> NBA basketball ?
> 
> (I would argue for NBA).

I would argue for both though.

> If for video conferencing, the time delay MUST be kept below about 200
> milliseconds, but the need for motion detection is reduced.

I suspect that negotiable options such as found in the H.26x codecs
should be used to choose between 'low delay video conferencing' or 'high
quality TV'. Possibly different modes of the video stream 'This is a
broadcast Ogg Video stream for decoders of Class N and higher only',
versus 'Lets use the Q settings for this video link'.

> If for the NBA, you should decide how far behind real time you are willing to
> run,
> (I would argue for at least 1 second)
> and provide at least the hooks for  a bit bucket.

And, for point-to-point transmissions, never forget to ask the reciever
what it's supported maximum bit bucket size is, and tell the receiver
how much to fill the bucket before start playing. We don't want the
'realaudio'-effect on slow links, where sound and video is choppy at the
beginning of the transmission due to the 'bit bucket underruns' (at
least, I've encountered that annoying effect sometimes with realaudio). 
For multipoint broadcasting, use something comparable to the MPEG
'levels' to limit the bit-bucket size per transmission type.

> If you say "both", then IMHO you need to think about what you are trying
> to accomplish.
> 
> >
> > I think if we want to compete based on compression ratio, then we should
> > somehow get at 0.1 bits per pixel or below. A CDROM is approx
> > 650x8=5.2Gbits, so for an hour of video you have 5200/3600=1.44Mbits/s,
> > which would dictate a compression to below approx (1.44/10.3)=0.14 bits
> > per pixel if there is to be any room left for audio etc.
> >
> > Of course it's easy to get 0.14 bits per pixel if there is no quality
> > requirement... When comparing compression methods, image quality is
> > often measured in PSNR (dB) or MSE (mean squared error). A compression
> > method can be considered better if it achieves better PSNR/MSE at
> > similar bit rates, or lower bit rates at similar PSNR/MSE. So, when
> > introducing a video compression method with amazing bit-rates, it can be
> > proven to have better quality than the alternatives by comparing the
> > PSNR/MSE at various bit-rates. Of course, the effecitveness of PSNR or
> > MSE as image quality measure is a point of discussion, so there is
> > always still room for interpretation of the numbers (note that there are
> > other measurement methods that attempt to give better numbers, there's
> > even an expert group (www.crc.ca/vqeg)).
> 
> When you start doing compression methods based on our vision system,
> the eye then becomes the best tool to measure performance.  MSE  or RMS type
> error
> metrics, although a routine metric for the performance of
> typical thermal noise based transmission channels, can give ridiculous results in
> this case.
> 
> Here is a simple example :
> 
> Suppose you have  a black and white TV system with
> 3.5 million pixels and 256 gray levels per pixels, and an average gray level
> value of
> 128. Now, suppose you compress by one method, which causes
> every pixel value to be off by one, randomly.
> Your eye would hardly notice this, and the MSE is 1.
> If a different compression sets a block of 10 by 10 pixels to all black or all
> white,
> right in the center of the screen,
> but perfectly renders all the other pixels, then the MSE is
> 
> (10^2 x 128^2)/ 3.5 x 10^6 = 0.468 (RMS pixel error is 0.684)
> 
> The MSE prefers the second compression method, the eye would strongly
> prefer the first.

I guess we all know the 'fool the eyes' images that demonstrate a number
of such effects that are sometimes even harder to pinpoint. Perhaps the
effects behind those images can be exploited to get better compression.

> My deeper point here is that you cannot escape listening / viewing trials when
> you
> are talking about compression methods tuned to our physical sensors (ears/eyes
> and brains). Done the conventional way, these are expensive. If the open source
> movement
> could extend to open evaluation of codecs by large numbers of people, then it
> could develop an incredible advantage here - evaluating new codec improvements in
> days, not months.

I agree, as long as we find a large enough group of people ready to
evaluate and collect/suppy video footage of 'problem scenes', then we
may be able to build the 'golden eyes' that help achieve this.

> > Ok, then there is the issue of variable or fixed bit-rate and variable
> > or fixed quality and encoder and buffering latency. If you have a
> > variable bit-rate encoder for a fixed quality stream, or a fixed
> > bit-rate encoder for a variable quality stream, then you can keep the
> > buffers small to reduce the latency. However, if you put a maximum on
> > the bit-rate, and don't want to accept occasionally reduced image
> > quality of the video, then you will need buffering to even out the
> > bit-rate on the hard-to-encode pieces of video, which of course
> > introduces latency. When buffering is needed, the decoder must know how
> > much to fill the buffer before starting to display to ensure that later
> > on, during display it never has to wait for compressed data to be
> > received during the hard-to-compress video scenes. Additionally, there
> > may be a limitation on the buffer size that is economical in the decoder
> > (especially in hardware, RAM=money). The MPEG standards include a scheme
> > to control this, centered around the 'video buffer verifier (VBV)'.  I
> > think Ogg video should address this issue as well.
> >
> 
> This requires some idea of how far behind real time you are running (see above),

Which may be annoying for live transmissions and two-way communications,
but much less of a problem for pre-recorded stuff.

It's all a matter of choice. I think that delay/latency, bit-rate
ceiling, image size and quality, frame rate, etc. all should be
changeable parameters (some users may find delay more annoying, others
may really like good video quality, others may accept a very low frame
rate, others may be ok with small resolutions). There are a lot of
factors here dependent on user preference and situation (maybe the
tradeoff between latency and compression is different on a modem than it
is on an ADSL, or wireless?). I think these issues can be kept out of
the way of developing the compression algorithm by making them tunable
afterwards. I think that using some example settings that are preferred
by the people most involved with development will be just fine for a
while. But I think there will always be people with different
preferences. I thinks it's best (read: easiest for the developers) to
let the high-level application writers, or the users themselves choose
which settings they like best for their use of the video codec. Just
remember to allow for the choice when developing the codec.

Cya,

Jelle.

--- >8 ----
List archives:  http://www.xiph.org/archives/
Ogg project homepage: http://www.xiph.org/ogg/
To unsubscribe from this list, send a message to 'vorbis-request at xiph.org'
containing only the word 'unsubscribe' in the body.  No subject is needed.
Unsubscribe messages sent to the list will be ignored/filtered.