[vorbis-dev] AMBISONIC critique

Mon Aug 14 21:47:30 PDT 2000

David Carter wrote:

>
> Thomas,
>
> Although Ambisonics allows for true 3D surround, it does not require it.  If
> you use the full four channels, you get 3D surround.  You can use three channels
> and get 2D (planar) surround, or you can use two channels and get M/S stereo.
> (The trivial case also gives mono with one channel, as you would expect.)  In
> Ambisonic B-format terms, spherical surround uses the WXYZ channels, planar
> uses WXY, M/S stereo uses WY, and mono-only uses W.  (M/S stereo is commonly
> used in FM radio transmission, as well as some MP3 joint stereo modes.)
>
> (This also allows for easy compatability with lesser setups than the track
> was made for -- if you want to play a spherical track on stereo speakers,
> simply ignore the X and Z channels.)
>
> Since M/S stereo (the technique planned for channel coupling) is exactly the
> same as 2-channel Ambisonic B-format sound, there seems to be little reason
> to make a seperate spec for M/S stereo.  I'm not sure whether Monty is
> intending for M/S tracks to be tagged as 2-channel Ambisonic or not, but that
> doesn't matter a whole lot.  As a surround format, Ambisonics seems to have
> considerable advantages, especially on the patent front.  (The relevant
> patents seem to have expired several years ago.)
>
> I'm not sure what you mean by "color blind" -- could you elaborate on that as
> well as why the WXYZ seperation should be frequency-dependant?  If you would
> also elaborate on your other reasons (if any) for disliking Ambisonics, I'd
> like to discuss them further.
>
>         David
>
> Layout of Ambisonic channels:
>
> ---  Front of room ---
>   ^
>   |  <--Y-->   Z (up/down)
> X |
>   v            W (omnidirectional)
>
> ---  Back of room  ---
>
> --
> David Carter ** dcarter at sigfs.org ** dcarter at visi.com
> PGP Key 581CBE61: E07EE199C767C752 8A8B1A9F015BF2EA
> Key available by finger or www.keyserver.net
>
> --- >8 ----
> List archives:  http://www.xiph.org/archives/
> Ogg project homepage: http://www.xiph.org/ogg/

Dear David;

   (BTW, I go by Marshall...)

Since you asked (a dangerous thing to do to a physicist : )

Some preliminaries :

The speed of sound in dry air is 331.45 (T / 272 K) ^ 1/2, or
about 345 m/sec in a room

This means that at 1 kHz, the wavelength is 1.1 feet (0.345 m), or about the size of your
head.

The ear/head system mostly relies on phase differences for localization (i.e., your
head is a phase interferometer, which is much more efficient than an
intensity interferometer, of which more later).

At 200 Hz (wavelength = 1.73 meters or 5.66 feet) and below,
the ear does not seem to be able to localize any more.
This implies an interferometric phase accuracy at 200 Hz of
about 0.2 cycles at 200 Hz. VERY roughly, this phase accuracy seems to be
constant with frequency up to about 10 kHz, yielding an interferometric angular
accuracy at 5 kHz of about  3 degrees. What happens above 10 kHz ?
At that frequency, the wavelength is only 3.5 cm (1.4 inches), much smaller
than your head, and the interferometer looses "phase coherence" (I.e., the difference
in phase between your ears from random ambient sounds
is simply too random to make use of.)

When phase coherence is lost, an intensity interferometer is still possible, and that
is exactly what the ear/head system does - only relative amplitude differences
are used to localize. This is much less efficient (accuracy goes as the square
root of the 1/ SNR, versus 1/SNR itself), but it's what's possible at high frequencies.

SO there are 3 rough frequency ranges for localization:

20 to 200 Hz              - no localization
200 Hz - 8 to 10        - phase interferometry
8 kHz   - 20 kHz        - intensity interferometry

This is not news, of course. That's why the woofer for my stereo is under my
coach - I can't really tell where they are. That's also why joint stereo exists for
MP3 - if you can't use phase, only amplitude, to localize, send one combined
signal and amplitude modulate it for the stereo channel. This takes fewer bits -
in principle, not much more than for one channel mono at those frequencies. As this
frequency band has a lot of bandwidth, this is a good thing.

Note that all of this is SEPARATE from the "masking" that depends on how our ears work. There
may be some stereo non-linearities (stereo masking) but I know of
no system that uses it.

Now let's look briefly at conventional practice. Your ears are located in a plane,
more or less, so most sound  systems concern themselves
with a 2-D representation. In general,
most musical performances, from Vivaldi to Pearl Jam, plus plays, speeches, etc.
comes from the front, so stereo speakers are not at right angles, but moved towards the front.
In a real performance in a real place, there will be reflected sound from
above and the sides, etc.

Surround sound in theaters is intended for two purposes :
1,) So that the speech is better localized at the screen, even if you don't sit in the
center of the auditorium (I.e., the "sweet spot" is expanded.)
2.) To have the occasional sound come from "behind" (like the creak of a door in a
thriller).

In a home stereo, #1 is not thought to be so important, but number 2 (for
reflected sounds) is, Reflected sounds have lower SNR, so the "surround"
part of the sound does take as many bits.

So, the conventional practice skews the coordinate system forward, and reserves
most of the bits for the conventional stereo.

Now - what is ambisonics ?
(See http://www.york.ac.uk/inst/mustech/3d_audio/ambis2.htm )

It's based on monopole expansions, which arise in electrostatics, gravitational field
expansions, etc.,  The pressure field (or any scalar field) can be expanded in terms
of spherical harmonics (AKA a multipole field).  Note that this is
really an expansion on the surface of a sphere (which might be at "infinity," or
the "unit sphere" at some nominal distance).
Now, the sounds we are presenting are varying in time, so there is really a separate
(complex valued) multipole expansion for each frequency. This can be lumped together
into time varying quantities W, Y, Y and Z, which correspond to
the P_00, P_11_c, P_11_s and P_10 terms in the spherical harmonic
expansion, with P_00  being the monopole, the P_1x terms being the dipole, and
the P_10 term - the z term of the dipole, capturing the up-down part of the dipole
field. These terms together are the so-called B format. It seems that the
Z or P_10 term is often dropped,

Now, some comments :

1.) The expansion breaks down after the dipole. There are 5 second order multipole
("quadropole") terms, and no
real reason to favor one over another (except that you might drop the P_20 term),
It seems excessive to have to add 4 or 5  channels to get any more functionality

2.) In light of the above frequency division based on the size of our head :

at f < 200 Hz, you only need the W term, which is OK

at 200 Hz < f < 10 kHz, you need W, X and Y to obtain the same functionality
which you get from right and left stereo. The location of the W speaker is
problematic at these frequencies (where you place it DOES count). The phase
matters a lot here, as here is where we use phase to localize.

at f > 10 kHz, it is not clear how you are to implement "joint stereo" and save bits
accounting for the ear's intensity interferometry. The location of the W speaker
is still problematic, but not as badly.

3.) In NO case will you have the localization ability of the 5 channel or even 4 channel
Dolby scheme ( for a similar total bit rate).

4.) The channels (W, X, Y and Z) are NOT loudspeakers located at a point,
but a particular sound distribution over space. How to get these from real
loudspeakers in general is very unclear to me. A particular type of
microphone or loudspeaker  is being assumed (from the above web page) :

"These [Y, Y nd Z] signals are equivalent to
three figure-of-eight microphones at right angles to each other, together with an
omnidirectional unit, all of which have to be effectively coincident over the frequency range of
interest"

This reliance on a particular transceiver is BAD. If you don't have these, what
sort of sweet spot will you have (you can always make things work at a point).

In summary, I  simply do not think that the Ambisonic scheme is particularly efficient nor does
it
scale. It seems way to mathematically rote to me, not tuned to the actual
physics of the situation.

                                   Regards
                                   Marshall Eubanks

P.S. I apologize for the length, but it's hard to get this kind of stuff
across in a short e-mail. I apologize for not sending this out in mid July. See
http://www.xiph.org/archives/vorbis-dev/0859.html
for the discussion back then.

   T.M. Eubanks
   Multicast Technologies, Inc.
   10301 Democracy Lane, Suite 410
   Fairfax, Virginia 22030
   Phone : 703-293-9624
   Fax     : 703-293-9609

   e-mail : tme at on-the-i.com     tme at multicasttech.com

 http://www.on-the-i.com         http://www.buzzwaves.com

--- >8 ----
List archives:  http://www.xiph.org/archives/
Ogg project homepage: http://www.xiph.org/ogg/