[theora-dev] My issues with ogg and directshow...

Sat May 8 12:14:37 PDT 2004

Listening to the meeting on granule pos tonight/today it became clear that
the issues everyone is concerned with for the most part don't affect my
implementations and the issues i have pretty much don't affect anyone
else... and in the cases where they overlap, the reasoning seems to be
different. And since everyone else has had a lot more time to consider all
these issues and i'm pretty new to this, it's a lot harder for me to make a
cogent argument on the fly. So i figure i'd spell out all the things i've
come across in my implementation, just to put them out there.

I'll just preface to say, that my experience in audio/video is probably
considerably less than most of the others working on this stuff, so if i
make false assumptions, am missnig the point etc, then just tell me ! :)

DShow Background
==============
Directshow is a very structured media framework, there are specific
interfaces for communication and guidelines for how and when data can be
passed. It is however highly modular and flexible enabling hundreds of
codecs to be implemented with it.

Some background... there are a few major components.
www.illiminable.com/ogg/graphedit.html  for a look what the graphs are like.
Graphs, filters, pins, samples and allocator pools.

In order for two filters to connect, their pins need to offer certain media
types specifying the type of data and various parameters of the data (frame
rates, frame size, sample rate etc) depending on the media type.

Allocator pools exist between the connection of any two pins. An allocator
pool is a fixed number of fixed size samples. All data is passed through the
allocator pools. Before the user starts the graph (presses play), no data is
passed in the graph. When the usr presses play the graph goes into pause
mode and data is pushed through the graph filling up all the allocator
pools, until all the threads are blocked, then the graph goes into play
mode. As the downstream end (renderers) pulls data out of the downstream
allocators it frees up a spot for an upstream filter, it's thread unblocks
and fills the space etc.

Directshow requires start and end times for all samples.

Demuxing
=======

Ok, so given that the graph has to be built before data is passed
downstream, there is a problem. How can the demuxer know what filters to
connect to (ie what the streams are) ? The demux needs to read ahead enough
to find the BOS pages. Now we know how many streams there are. How does it
know what kind of streams they are ? It has to be able to recognise the
capture patterns of every possible codec. So a "codec oblivious" demux is
already out of the question.

Lets look further downstream for the moment... we'll assume we have a vorbis
only stream. Now the directsound audio renderer won't connect to any decoder
unless it tells it the audio parameters, number of channels, sample rate etc
etc. Now if no data can flow in the graph yet, how can the decoder have seen
the header pages to know this ? It can't. This information is considered
part of the setup data. Hence the media parameters have to come from the
demux when it connected to the decoder, ie the media type the demux offers
is (Audio/Vorbis 2 channel 44100) for example.

So the demux has to be able to parse the BOS page headers to offer a useful
media type. So now the demux has to be able to not only identify the streams
but also know how to get at least the key information out of them. ie The
demux has to know how to parse the header of every possible codec header
format it will offer.

Now, why isn't this an issue with every other codec i assume you are
thinking ?

The main reason is that the header format of ogg codecs (ie vorbis headers,
speex headers etc) is completely arbitrary and defined completely by the
codec. That's good in a way that codecs can define whatever information they
want. But it's bad in the sense that your demux can't be as dumb as you'd
want. Other formats have at least portions of fixed header, where no matter
what the exact details of the codec, some core information can be gauranteed
to be found at fixed locations. And also codecs identifiers are fixed (or at
least bounded) size and in fixed location. So you can do for example a
fourcc map of the identifier to a directshow media-type guid, and get the
key parameters from a fixed place. So all this information is available up
front, and the demux doesn't need to know any specifics of codec headers,
and it can handle new codecs without modification to the demux.

Incidentally this is all that OGM is, just an extra header before the codec
specific ones that contains this information. Similarly annodex for example
uses anxdata headers which preface each codec stream and contains
information like granule rate and codec identifiers in fixed locations of
bounded size.

The related issue is that of identifying streams... the codec identifier has
no bounds, there is no way to say this is the end of teh identifier, and
this is the rest of the header. In other words \001vorbis is pretty much
indistinguishable to \001vorbis2. How can you tell if the 2 is part fo the
identifier or the rest of teh header ?

Time Stamps
=========

Directshow works in UNITS of 1/10,000,000 of a second, it knows nothing of
granule pos. When something like media player requests a seek or a position
request it wants these units. So the seek request comes into the graph. It
needs to be passed back to the demux being the only portion of the graph
with direct access to the data source. Now in order to seek in ogg, you need
granule pos, so again the demux needs to know how to make the conversion.
The decoding filters can't make this conversion, because they only know
about their granule pos, so even if they did convert and try to get the
demux to seek on this granule pos, it would restrict the available seeking
landmarks to only that codec. So again the demux needs that information
about "granule rate" in order to make the conversion for each codec it may
come across in it's seek.

Now after we seek, we hit a page we want to start from (and maybe go back a
bit to ensure we get a keyframe etc)... so when we scan back we find a new
starting point. Directshow now considers the time point it asked to seek to
as time 0. It doesn't want to know about absolute times.

So we are at a point a few pages before we want to start, we have to make
sure we hit one page of every logical stream in order to get a landmark
granule pos. Now thats kind of ok for dense codecs... but what about sparse
ones ? With the end time stamp scheme we have to find at least one page of
every stream before we get our deisred one.

Using the start stamp scheme we can resync as we hit a page. As we get a
page we know what time this page starts at.and we then have a reference
point to determine start and end times of every subsequent sample in that
stream. this means less seek back.

My personal preference for the timestamp scheme would be start timestamps
for all codecs. As assuming you want start and end times finding the end
given the start is much easier and efficient than finding the start given
the end time.

As for stream duration, i see no problem with having an empty EOS page which
has the end time in it.

But from the sounds of it, this isn't the general consensus.
================================================

Anyway... i've said my bit for now ! :) This is long enough for now and i'm
tired !

Zen.

<p>--- >8 ----
List archives:  http://www.xiph.org/archives/
Ogg project homepage: http://www.xiph.org/ogg/
To unsubscribe from this list, send a message to 'theora-dev-request at xiph.org'
containing only the word 'unsubscribe' in the body.  No subject is needed.
Unsubscribe messages sent to the list will be ignored/filtered.