[theora] Fixed Quantizer - Fixed Quality

Tue Mar 25 20:45:05 PST 2003

> From: Stan Seibert [mailto:volsung at mailsnare.net]
...
> Is there a reasonable "psycho-visual" model to work with?
> 
(in booming narrator voice:) "Well Stan, that's an excellent question!!"  

I'm just starting to review the present state of research (see my link in a previous post to the 'ITS' objective measurement stuff for instance -- I'm pretty impressed with their stuff so far).  In my own research, I've looked at frequency-banded PSNR, as well as modifications to PSNR to account for the fact that low contrast scenes will have a much lower MSE for the perceived error (presumably because the eye/brain is doing contrast adjustments on a region basis).  This is a big issue -- more on that later (quick point: PSNR usually is calculated with a presumed pixel value range of 0-255 [20 * log10(255 / sqrt(mse) )].  What if the image has a range of 50 to 200?  Shoudn't the formula then be 20 * log10(150 / sqrt(mse) ) ?? )

All of this begs the question: what exactly does the eye/brain do with an image?  One big problem that makes the video side harder than audio is that viewing conditions can vary so widely.  everything from a movie theater (dark room with a large, hi-res screen) to looking at some multimedia on your iPAQ outside on a sunny day.

My general impression is that most people agree we perceive images through some sort of wavelet-like combination spatial/frequency decomposition.  Obviously, we have circuits to do feature extraction at various levels (edge detectors, etc).  So my guess would be that we need to break the image down into reasonably sized areas (the size of the regions is very dependent on viewing conditions; optimum is probably a specific angle of vision).  We also have to consider how to segment an image into regions without problems arising at the region boundaries.  Then, within these regions, we need to do some sort of frequency domain analysis, and empirically learn what the JND's (Just Noticeable Differences) are for various types of distortion (noise, low-pass, phase distortion, quantization...), all normalized to the overall energy of the region.

In other words, we need a comprehensive model of allowable threshold distortions (as a function of total energy) in a combined spatial/frequency domain.  Then we can tune our codecs to produce errors that fall within those thresholds, allocating bits accordingly.

Yeah, something like that sounds nice.

-dan
--- >8 ----
List archives:  http://www.xiph.org/archives/
Ogg project homepage: http://www.xiph.org/ogg/
To unsubscribe from this list, send a message to 'theora-request at xiph.org'
containing only the word 'unsubscribe' in the body.  No subject is needed.
Unsubscribe messages sent to the list will be ignored/filtered.