[vorbis-dev] Thought for the new year

Wed Dec 27 17:33:32 PST 2000

Gregory Maxwell wrote:
> 
> On Thu, Dec 28, 2000 at 01:57:27AM +0100, Segher Boessenkool wrote:
> > I disagree. Look at what a MDCT does to an attack... It gets a very
> > flat spectrum. That's no good. You want smaller (effecetive) windows
> > at higher frequencies to adjust to the highly dynamic range of sound.
> > For video, ou don't care that much, as in the presence of bright
> > pixels, the dim pixels will be invisible; in audio, this is true only
> > _sometimes_.
> 
> But what does an attack 'sound' like?

The start of a sound. And the start (especially when tonal) is most
imortant for the perception of the sound.

> A rigid block size mdct is not the perfect transform for human perception,
> but I don't think it's that bad. An adaptive spectrogram with suitable
> backend processing would be better, but are FAR too slow for our purposes.

But we can get closer than MDCT without too much cost.

Example:

Say the outputs of your mdct are b_j, 0 <= j < 2N
We calculate b1_j = b_{2j} + b_{2j+1} and
b2_j = b_{2j} - b_{2j+1}

Now the b1_ and b2_ have less frequency resolution, but b1_ has a window
which is effectively only for the first half of the window of b_ , and
b2_ for the second half. You can do this for only part of the frequencies
as well, and do it recursively. This is not expensive (O(n), where the MDCT
itself is O(n log n)), and gives better results when quantized. Unfortunately,
this exact implementation probably is patented (generalized lapped
biorthogonal transform or something like that). But not everything
between the
FFT and the wavelet transforms is patented...

> I think the greater problem is that when you quantize the mdct of a sample
> with transients, the result sounds more dissimilar then you would expect
> from the level of quantization, mostly due to de-localization.

That's what I (am trying to) say.

> Ideally you would analyze the signal in a multiscale frequency power over time
> space (ideally an adaptive windowed spectrogram), but compress the audio
> using a transform that does better with respect to localization.

This sounds similar to what I'm proposing :-)

> > > Ideally what you want to model is the human perceptual response to signal.
> > > All we need to do is take a living human ear, and the appropriate 'chunk of
> > > brain', plug it's output back into the computer to create a 'human ear
> > > transform'. :)
> >
> > That would be great, as you would get _very_ low bitrate; but the
> > problem would be the inverse transform :-(
> 
> Not really, if you were able to do this, you could create an approximate
> set of transforms. I think what you would find is that one ear+brain differ
> too much from another and after sufficient generalizing you wouldn't be much
> better off then we are now. :)

Not exactly. The frequency selectivity of the human ear + brain is so big,
that we would need a transform of length 10^(very big) to approach it.

Our hearing is non-linear to the extreme (the brain actually mechanically
adjusts the cochlea to "pick out" tones), while our transforms are linear.

Cheers,

Segher

--- >8 ----
List archives:  http://www.xiph.org/archives/
Ogg project homepage: http://www.xiph.org/ogg/
To unsubscribe from this list, send a message to 'vorbis-dev-request at xiph.org'
containing only the word 'unsubscribe' in the body.  No subject is needed.
Unsubscribe messages sent to the list will be ignored/filtered.