[vorbis-dev] Ogg as container format

Tue Oct 2 01:18:08 PDT 2001

At 12:53 PM -0400 9/28/01, Monty wrote:
>On Thu, Sep 27, 2001 at 06:22:44PM -0700, Kevin Marks wrote:
>  > IFF-like formats have stood up very well over time because of the
>>  future compatibility built-in (the behaviour for unknown chunks is
>>  well-defined).
>
>IFF-like formats including more than one media type cannot be streamed
>as is.  All the media types are sequential.  Quicktime, at least at
>one time, was also like this.  I've not checked since just prior to QT
>4.0ish.

Not true. QT departed from IFF in separating the indexes into the 
data from the data itself, allowing the data to exist in the same or 
separate files, and allowing the data to be moved around. Even before 
1.0, QT had the notion of 'flattening' - re-interleaving the various 
media data into a new file in the order it was needed.

>The behavior of unknown types in Ogg is also well defined.

Good!

>  > > Quicktime was not intended for streaming use when it was invented.
>>  Remember, QuickTime is older than HTTP.   I can't fault them for not
>>  predicting the future.
>
>That's also perfectly fair. I wasn't faulting Apple; practically no
>one saw that coming.  I'm simply making the point that it wasn't.

As Steve says, a CD-ROM is a lot like a HTTP transfer. Streaming over 
RTP is different. On 1991-era CD-ROMs, a seek cost 200-500 ms, and 
the bitrate was limited to about 80 kBytes/second as async CD drivers 
didn't exist.

>  > QT's structure was picked up by the MPEG4 committee because of 
>this robustness.
>
>It was chosen because of Apple's lobbying, mindshare, relative lack of
>licensing restrictions (only one entity has the patents) and because
>so much software already exists to support it.
>QuickTime, frankly, has more technical baggage than any other
>container format you can think of, and it has patent issues (just
>fewer than MPEG's own system streams).  There's no technically
>compelling reason to use Quicktime.

I wasn't at the relevant meetings myself, but my understanding is 
that the existing richness of QT was appreciated - most of the issues 
raised by a format as complex as MPEG-4 had already been covered and 
implemented in QT.

>  > QT defines the structure of a particular movie independently of the
>>  data - you can gain enough information to seek anywhere in the file
>>  by reading this movie header from the front.
>
>Quicktime until this year couldn't do VBR formats at all.  'Enough
>info in the header' simply means 'everything is the same size' and
>that's a liability, not a feature.  If you stick indexes in the
>header, the encoding must be two pass, also a liability if mandatory.

QT has handled VBR video since day one. The file format provides a 
way to represent VBR audio, though as noted it may not be efficient 
for low bitrates. However, the component nature of QT allows other 
ways to do things, such as a mediahandler or data handler to support 
other formats. An audio format like Vorbis that is unpredictable in 
length and duration of each chunk may require a new file format 
revision in future.

>  > In fact, this header can
>>  be completely independent of the media data, which is how QT is able
>>  to import so many other formats.
>
>..and yet there's no reason to do it this way.  That header has
>*nothing to do* with being able to import other formats.

Of course it does. The separation of index and data means that QT can 
import and edit DV files or AVI files or sequences of 2000 JPEG files 
in place, and read the actual data as needed.

>It also cannot really be streamed.  It has to be broken up and sent in
>multiple silmultaneous, parallel streams with the sender seeking madly
>through the file to continue just-in-time delivery of the multiple
>media types.  Again, this is the way things were 1996-ish.  It may be
>different today.
>
>Quicktime was not intended for streaming use when it was invented.

No, but in 1.5 or so the flattening process described above meant it 
could be played from a CD-ROM at the full transfer rate of the device 
without seeking at all. The hint track notion, (which I think is the 
patented part) provides a way to construct packets to stream over RTP 
by reading existing media data. You can have multiple alternative 
hint tracks to cover different bitrate or language versions. For 
example, you could do the mythical Vorbis packet truncation by having 
a full-rate hint track that points at the whole of each Vorbis 
pacekt, and another one that points at the first half of each packet.

There is a variant of this that has a copy of the media data in 
transmit order in the hint track - this is what you get when you 
'optimise for server'. In effect you are saying that Vorbis is like 
the 'optimise for server' variation only.

>  >  From what I can see of Ogg, everything is down in the stream
>>  structure, and the lacing values used for packet framing will
>>  introduce a lot of overhead for packets bigger than 1024 bytes.
>
>No, framing/paging is a constant .5%-1% overhead for large packets.
>The lacing/framing is designed the way it is for a reason (roughly
>constant overhead regardless of packet payload size).
>
>>  What is the point of making packets and pages independent, and having
>>  two parallel framing structures going on at once, with the
>>  concomitant problem of having to slice and dice the whole time?
>
>'Packets' are not a framing structure.  Only paging provides
>captutre/framing.  If you look at the way things are set up, you'll
>notice that there's no duplicated functionality and that packets and
>pages are wholly orthogonal, asynchronous concepts in Ogg.
>
>Pages are a way of freezing packets of arbitrary size into a stream.
>
>You've obviously read the spec, you just need to think a little more
>about it.

Are the packets related to the encoding or not? For Vorbis, they seem 
to be, with a single packet corresponding to a frame of audio.

>  > You're going to have big trouble getting DV or uncompressed video
>>  into this structure.
>
>Bull.  Both were very much on my mind when I designed all this.
>
>>  Dv frames are 120000 bytes for NTSC and 144000
>>  for PAL. They are all the same size. To put these in Ogg you need 471
>>  & 565 lacing values per frame, and you need to add up these bytes to
>>  get the constant length.
>
>One would never put an entire frame in one packet.  One *could*, but
>that would be silly.  Think about why.  Others may feel free to chime in.

Here you seem to imply that the packets are to do with RTP networking 
units. For DV do you use the DV RFC to decide these? Or are they 
arbitrary? You have 2 size-limited structures overlaid on each other, 
neither of which necessarily corresponds to a fundamental data unit 
of the underlying media.

<sarcasm>
You call QT old-fashioned, and then include hard-limited 64k page 
sizes? I'm having DOS flashbacks here. Looks like you have a great 
format for video, as long as it's 320x200 8-bit  VGA Mode 13h.
</sarcasm>

>As for 471/565 lacing values, that's less than half a percent overhead
>for potentially very fine grained packetization (hint; if you're doing
>things right, each packet is about 200-500 bytes and the overhead is
>*still the same*).  Doing it any other way would kill us in overhead
>for *small* packets (like low bitrate audio) where packets are only
>40-50 bytes.

As Steve points out, big chunk DMA is very useful for high bitrate 
video, whereas most good networking implementations will do 
scatter/gather to construct packets. Where did you get those packet 
sizes? The usual bottleneck on RTP is the Ethernet frame size of 1460 
or so data bytes.

How does one seek a Vorbis file with video in and recover framing?

It looks like you skip to an arbitrary point and scan for 'OggS' then 
do a 64kB CRC to make sure this isn't a fluke. Then you have some 
packets that correspond to some part of a frame of video or audio. 
You recover a timestamp, and thus you can pick another random point 
and do a binary chop until you hit the timestamp before the one you 
wanted. Then you need to read pages until the timestamp changes and 
you have resynced that stream. Any other interleaved streams are 
presumably being resync'd in parallel so you can then get back to the 
read and skip framing. Try doing that from a CD-ROM.

If you happen to chain 2 files that use the same stream serial 
number, you are hosed, as you'll get packets that belong to the wrong 
stream when doing this. to prevent this, a server will have to 
rewrite each page header with a unique stream number as it goes out.

>  > > There's nothing Quicktime does that Ogg cannot.  The difference is
>>  > that Ogg is doing it all at rev 0.
>  > QuickTime (and other formats like it) are very good at editing without
>  > moving lots of data around.  That indirection that you complain about
>  > saves lots of time.  Especially when it comes to video editing.
>
>Ah, but we're not arguing about general purpose editing/streaming
>container formats.  We're arguing about transport streams, designed
>specifically as 'streams frozen in place'.

That's known as bait and switch. I AM arguing about general purpose 
container formats in response to Michael's orginal comment:

At 8:34 PM +1000 9/26/01, Michael Smith wrote:
>rm goal is for ogg to be a generic media container format in
>the same way as riff (avi/wav), qt, and so on are. Except better,
>hopefully ;-)

Now you say:

>Ogg is designed to do one
>thing and do it well.  Quicktime was designed as a
>swiss-army-kitchen-sink, not the ultimate streaming container.  Ogg is
>a streaming container, period.

So there IS lots that QT can do that Ogg can't. Editing, skip to time 
with a single seek, DMA-friendly high-rate video, variable bitrate 
streaming.
>
>>  I think there is a great reason for keeping entire frames contiguous in
>>  the file format.  Hardware acceleration.  Disk controllers do DMA very
>>  well these days.  They understand things like put this big chunk of bytes
>>  over there.  One could even conceive that the DMA goes directly to the
>>  DV decoder on the PCI or other bus, completely bypassing main memory.
>
>Contiguous, yes, definately.  But not in one packet.

How can they be contiguous if the lacing values are inbetween the 
packet data in the pages, and the pages can't be more than 64k?

>
>>  It comes down to this.  If your world is HTTP streaming
>
>General unicast and multicast streaming

RTP streaming then. HTTP does not need packet boundaries in the file. 
It is a stream in the filestream sense instead.

>  > either live or
>>  with very little interactive editing other than switching between streams,
>>  then Ogg has advantages over QuickTime.  Once you step out of that world,
>>  then QuickTime has advantages over Ogg.
>
>OK, we agree on this bottom line.  We just needed to agree vehemently
>first :-) Ogg transport streams are most definately intended for
>finished-product streaming.

You mean Only for streaming over a packet-based protocol.

--- >8 ----
List archives:  http://www.xiph.org/archives/
Ogg project homepage: http://www.xiph.org/ogg/
To unsubscribe from this list, send a message to 'vorbis-dev-request at xiph.org'
containing only the word 'unsubscribe' in the body.  No subject is needed.
Unsubscribe messages sent to the list will be ignored/filtered.