[vorbis-dev] DWIT

Sat Jan 13 18:46:36 PST 2001

Frank Hale wrote:

> > BUT, it's not clear to me what you want to do, and what
> > you really need
>
> Well I would just like to get an overview of what encoding audio is all
> about.

Ah, that much is straightforward, and requires no maths at all. Basically, you
compress audio in one of two ways. You either leave out what the ear cannot
detect, or you don't bother to code things which the source cannot generate. In
practice this generally leads to two classes of audio compressor. For general
coding of abitrary sources of sound, such as music, you can only work on the
limitations of the ear. MP3 and Vorbis are like this. For many applications, such
as cellphones, you will only normally encode a single voice speaking. The voice
generates a much more limited range of sounds than, say, the London Symphony
Orchestra, so you can compress harder.

The majority of voice coders (vocoders) are actually based on a mechanical model
of the human vocal tract. This is then modelled mathematically. On its own, this
sounds rather robotic. Its the basis on which cheap childrens's toys synthesise
speech.  To fix this, a number of elaborate error correction schemes have been
devised. These have names like RPE, or almost anything ending in ELP. Most of the
MIPs are burned in this error correction part, rather than the basic vocal tract.
Voice coding then becomes a form of analysis by synthesis. You start with a rough
estimate of how to stimulate the model to make the sound you are trying to
encode, and then iterate to refine it. The result can be quite good for rates
down to about 4kbps. The main disadvantage is that nothing but a single human
voice codes well. Using a digital cellphone in a noisy environment sounds awful,
as the background noise doesn't code well. Sing into your phone and it sounds OK.
Sing with a friend and is sounds awful, whether or not your singing really sounds
awful. There are other less popular techniques for voice coding, such as IMBE,
but the basic idea is the same - don't waste bits trying to code what a vocal
tract cannot produce.

Naturally, a system like Vorbis or MP3 cannot have the limitations of a vocoder.
They have to be based on not wasting bits coding what the ear cannot detect.
People have been using simple forms of this since the 1950's. For example, you
can't hear quiet sounds well, if they are masked by loud ones. The telephone
network compresses 96kbps down to 64kbps, by a pseudo logarithmic compression of
the audio waveform, to end up with about 30dB signal to noise ratio regardless of
the volume of the sound. From the 1970's, NICAM does something similar. This is
not a very efficient technique, but is extremely simple to do, and 1958
technology could do it. Now we have the compute power to be much more aggressive.
There are many things the ear cannot do well, and by allowing for all of them you
can compress good (if not high) quality music to a fairly modest bps level. Of
course, being based on the limitations of our own ears, some other animals might
feel the result is very lo-fi! There are a number of well known characteristics
of the ear which can be exploited - we have much better frequency resolution at
low frequencies than at high frequencies; we have almost no direction sensing for
very low frequencies (not useful compressing mono); we are far more annoyed by
unwanted even harmonics, than unwanted odd harmonics; our absolute frequency
detection is poor, but our relative detection is good (some people seem to have
perfect pitch detection, so this doesn't always work). The list is quite long.
Current compressors exploit a number of limitations, but none is yet clever
enough to exploit them all. That leaves lots of room for fun research!

Regards,
Steve

--- >8 ----
List archives:  http://www.xiph.org/archives/
Ogg project homepage: http://www.xiph.org/ogg/
To unsubscribe from this list, send a message to 'vorbis-dev-request at xiph.org'
containing only the word 'unsubscribe' in the body.  No subject is needed.
Unsubscribe messages sent to the list will be ignored/filtered.