[ogg-dev] OggPCM2 : chunked vs interleaved data

Tue Nov 15 15:25:44 PST 2005

On 2005-11-16, Jean-Marc Valin wrote:

> Otherwise, what do you feel should be changed?

One obvious thing that seems to be lacking is the granulepos mapping. As 
suggested in Ogg documentation, for audio a simple sampling frame number 
ought to suffice, but I think the convention should still be spelled 
out.

Secondly, I'd like to see the channel map fleshed out in more detail. 
(Beware of the pet peeve...) IMO the mapping should cover at least the 
channel assignments possible in WAVE files, the most common Ambisonic 
ones, and perhaps some added channel interpretations like "surround" 
which are commonly used but lacking in most file formats. (For example, 
THX does not treat surround as a directional source, so the correct 
semantics cannot be captured e.g. by WAVE files. Surprisingly neither 
can the fact that some pair of channels is Dolby Surround encoded, as 
opposed to some form of vanilla stereo.)

(As a further idea prompted by ambisonic compatibility encodings, I'd 
also like to explore the possibility of multiple tagging. For example, 
Dolby Surround, Circle Surround, Logic 7 and ambisonic BHJ are all 
designed to be stereo compatible so that a legacy decoder can play them 
as-is. But if they are tagged as something besides normal stereo, such a 
decoder will probably just ignore them. So, there's a case to be made 
for overlapping, preferential tags, one telling the decoder that the 
data *can* be played as stereo, another one telling that it *should* be 
interpreted as, say, BHJ, and so on. Object minded folks can think of 
this as type inheritance of a kind. But of course this is more 
food-for-thought than must-have-feature since nobody else is doing 
anything of the sort at the moment.)

> Anyone wants to speak in support of chunked PCM?

Actually I'd like to add a general point against it. The chunked vs. 
interleaved question is an instance of the more general problem of 
efficiently linearizing a multidimensional structure. We want to do this 
so that typical access patterns (and in particular locality of access) 
translate gracefully and efficiently. Thus we group primarily by time 
(interleaving) when locality is by time (accessing a sample with a given 
sampling time most increases the odds that a sample with a close by 
sampling time is soon accessed) and primarily by channel (chunking) when 
locality is by channel (accessing a channel will make it probable that 
the same channel is accessed again); we also try to preserve rough order 
of access.

Ogg is primarily a streaming delivery application, so we usually access 
Ogg data by ascending time. Ogg does not support nonlinear space 
allocation or in-place modification, so editors which are probably the 
most important application in need of independently accessible channels 
will not be using it as an intermediate format in any case. We're also 
talking about multichannel audio delivery where the different channels 
are best thought of as part of a single multidimensional signal, not a 
library-in-a-file type collection of independent signals, so it can be 
argued that the individual channels do not really make sense in 
isolation. In this case access won't merely be localised in time, but in 
fact the natural access pattern for recorders, transmitters, players and 
even some filters is a dense, temporally ascending scan over some 
interleaved channel ordering.

If we think of Ogg as a line format, all this translates into lower 
packetization latency and memory requirements (buffer per multichannel 
stream vs. buffer per channel) for interleaved data; if we think of Ogg 
as a file format it translates into fewer seeks and less framing 
overhead while streaming from disk. In most cases a chunked layout has 
no countervailing benefits. Even interfaces which go with separate 
channels aren't such a good reason to offer a chunking option because 
were probably designed with some other application (like interactive 
gaming or offloading processing load onto a peripheral) in mind, or 
might simply be badly engineered (just about anything from MS).

Furthermore, if we really encounter an application which would benefit 
from grouping by channel (say, language variants of the same 
soundtrack), that can already be accomplished via multiple logical 
streams. In fact the multiplexing machinery is there for this precise 
purpose: the packet structure is a deliberate tradeoff between the 
temporal order always present in streaming files and the conflicting 
interest in limiting latency, error propagation and buffer consumption, 
brought on by parallelism, correlations and indivisibilities over 
dimensions other than time. If the channels are so independent of each 
other or so internally cohesive that chunking is justified, then they 
ought to be independent enough for standalone use and for placement in 
separate logical streams, or even separate files. Whatever 
interdependencies they might have ought to be exposed to the consumer 
via OggSkeleton or external metadata in any case. Thus whatever we want 
to accomplish by chunking is probably better accomplished by the broader 
Ogg framework, or by some mechanism besides Ogg altogether.

The only valid reason to chunk the data I can think of is bitrate 
peeling: chunking means that entire chunks/packets can be skipped to 
drop channels. But this clearly isn't the best way to go about peeling 
because, as I said, audio channels tend to be tightly coupled. We don't 
go from stereo to mono by cleaving off the right or left channel, but by 
summing, and if we simply drop a surround channel, we'll also break any 
multichannel panning law. Thus if we want to enable peeling, we have to 
use things akin to mid/side coding (like the UHJ hierarchy) or joint 
progressive coding over the entire set of channels (e.g. Vorbis's 
progressive vector quantization), and only then reorder and chunk the 
data. As a result this sort of stuff will always be encoding dependent 
and it shouldn't be specified at a higher level of generalization where 
the machinery could end up being used for the wrong sort of encoding 
(e.g. vanilla 5.1) and would impose its overheads (e.g. latency) 
indiscriminately.

Not surprisingly this is how it's already done in Ogg: at least Vorbis 
specifies that peeling is to be carried out by a codec specific peeler 
operating within packets. The considerations which yielded this decision 
apply directly to an intermediate level abstraction like OggPCM (below 
Ogg multiplexing but also above a specific PCM coding like 16-bit big 
endian B-format), so I think incorporating a chunking option here would 
really represent a case of reinventing the wheel, square.

(Newbie intro: I'm a 27-year old Finnish math/CS student and coder, with 
a long term personal interest in both audio processing and external 
memory algorithms, yet without an open source implementation background. 
I joined the list after OggPCM was mentioned on sursound, so it's also 
safe to assume I'm an ambisonic bigot.)
-- 
Sampo Syreeni, aka decoy - mailto:decoy at iki.fi, tel:+358-50-5756111
student/math+cs/helsinki university, http://www.iki.fi/~decoy/front
openpgp: 050985C2/025E D175 ABE5 027C 9494 EEB0 E090 8BA9 0509 85C2