[theora-dev] My issues with ogg and directshow...

Mon May 10 17:40:53 PDT 2004

----- Original Message -----
From: "Ralph Giles" <giles at xiph.org>
To: <theora-dev at xiph.org>
Sent: Tuesday, May 11, 2004 12:21 AM
Subject: Re: [theora-dev] My issues with ogg and directshow...

<p>> On Sun, May 09, 2004 at 03:14:37AM +0800, illiminable wrote:
>
> > Listening to the meeting on granule pos tonight/today it became clear
that
> > the issues everyone is concerned with for the most part don't affect my
> > implementations and the issues i have pretty much don't affect anyone
> > else... and in the cases where they overlap, the reasoning seems to be
> > different. And since everyone else has had a lot more time to consider
all
> > these issues and i'm pretty new to this, it's a lot harder for me to
make a
> > cogent argument on the fly. So i figure i'd spell out all the things
i've
> > come across in my implementation, just to put them out there.
>
> Thanks for putting this together, Zen. It's really nice to have a solid
> introduction to the issues from someone experienced with the framework.
>
> > Allocator pools exist between the connection of any two pins. An
allocator
> > pool is a fixed number of fixed size samples.
>
> I can see how this works fixed-bitrate codecs (and most uncompressed
media, of
> course). Does one just use 'really big buffers' for vbr data?
>

One makes a best guess and then adds some :)

As an example the outputs of the demux each have 3 alocated samples of size
~65000... this is sufficient for 99.99% of cases, in reality i probably
would be safe with 8000 or 16000, but 2^16 is much safer bet. Though none of
the current codecs need it, if a particular decoder filter knows it's codec
needs to pass samples bigger than this, as part of the connection process it
will request the demux to allocate larger buffers. This is also another
reason why knowing video frame size easily from the header is beneficial.

Because in any AV graph, the largest samples are giong to be the raw RGB or
YUV coming out of a video decoder into a video renderer. These are of fairly
predicatble size if you know the output format and frame size.

> Directshow requires start and end times for all samples.
>
> And you've succeeded in calculating this for all our codecs?
>

Yes, kind of... but its a pain ! I have to keep track of the end time of
previous passing pages and then maintain a local count of frames or samples
processed since the last page granule pos. Which is no biggy for single
stream... in fact it could (not ideally) ignore granule pos and just use
it's own frame count.

A big problem is when you seek, you lose all that past page information, so
in order to successfully continue to do this, you have to seek to a point
such that you pass at least one page of every stream to get that reference
time again. At the moment in my implementation  (which doesn't work) when
you seek in theora, vorbis just assumes that it's next sample starts when
theoras first sample did... which is not valid. If the times were start
times, you would only have to satsfy the key frame and overalp conditions
and not have to see one page of every logical stream to resync. Which would
be a particular problem in a sparse codec.

Take this example with three logical streams A B C

AACBABABBAABBAABABCBB

Lets say we want to seek to the last A. Lets assume the simplest case where
no codec needs any preroll.

So we arrive at A and we have no idea what time the data we are about to get
starts at... so we seek back (incidentally the most inneficient of all
operations we can perform on the stream) until we find an A two pages back.

Depending how we back seeked we may or may not know that we back seeked over
a B page. So if we just did a jump back and scan forward we probably don't,
so for all we know a B and C could lie betwen this A and the A we want to
start from... so we scan forward page by page back up to where we started
from and notice be also hit a B page... but not a C page yet. So we jump
back further and seek forward again.

Alternatively we scan back byte by byte so we can see each page as we back
seek over it. Basicly just doing a linear search back through the file.

Either way, if we try and play now, only having seen a previous A and B,
when we hit that last C we are stuffed. We have no reference to associate it
to the times in the other streams.

So before we play we have to keep back seeking until we find a C page as a
reference point. Efectively in this case we back seeked almost the entire
file to find that C page.

> > Ok, so given that the graph has to be built before data is passed
> > downstream, there is a problem. How can the demuxer know what filters to
> > connect to (ie what the streams are) ? The demux needs to read ahead
enough
> > to find the BOS pages. Now we know how many streams there are. How does
it
> > know what kind of streams they are ? It has to be able to recognise the
> > capture patterns of every possible codec. So a "codec oblivious" demux
is
> > already out of the question.
> >
> > Lets look further downstream for the moment... we'll assume we have a
vorbis
> > only stream. Now the directsound audio renderer won't connect to any
decoder
> > unless it tells it the audio parameters, number of channels, sample rate
etc
> > etc. Now if no data can flow in the graph yet, how can the decoder have
seen
> > the header pages to know this ? It can't. This information is considered
> > part of the setup data. Hence the media parameters have to come from the
> > demux when it connected to the decoder, ie the media type the demux
offers
> > is (Audio/Vorbis 2 channel 44100) for example.
> >
> > So the demux has to be able to parse the BOS page headers to offer a
useful
> > media type. So now the demux has to be able to not only identify the
streams
> > but also know how to get at least the key information out of them. ie
The
> > demux has to know how to parse the header of every possible codec header
> > format it will offer.
> >
> > Now, why isn't this an issue with every other codec i assume you are
> > thinking ?
>
> To clarify here, it's my understanding that format parameter lookup is a
> feature of the AVI and ogm container formats (and asf, presumedly) not of
> any of the specific codecs. Is this correct?

Yes... sorry container format not codec.
>
> That's why lookup of this information is always possible there, and not
> for ogg, even if we provide a convenience library that can do the header
> parse for all the codec embeddings it knows about, as I think derf was
> suggesting.
>

As i mentioned in the email reply a few minutes ago, this is all fine if you
accept that every time you have a new codec, you need a new helper library.
Doing the header parse itself is not a big deal, it's not that it's
particularly difficult to wirte a helper library as a new codec is added,
but from a practical point of view, this is not a good choice. Ideally you
do it once and it works for all others (which ogm and avi do).

It means that if you have an older version of the demux and it doesn't
recognise the header, you basically have nothing. You have no way to know if
its a damaged/invalid stream or if you just don't know how to parse it.

> Practically speaking, I think this can be dealt with. After all, being
able
> to identify a codec by FOURCC doesn't help if you can't find an
implementing
> dll. From the point of view of DirectShow, it's just a limitation of this
> particular container format.
>

That's true, if you can't find an implementing .dll then it's no immediate
help. But i don't see that this is a valid argument... in any container
format, if you don't have the codec you can't play it, this holds true for
all formats.

However if you have a GUID (globally unique identifer), you can
automagically download it, install it and then you can play it.

So by having a GUID you at least know which component it is you want and can
get it, without this the codec is nothing more than random data.

> Not knowing anything about them, I'd guess that quicktime can optionally
> provide a table with this information, and that MPEG program streams, like
> ogg, don't provide much beyond the packet types. How does DirectShow
handle
> those containers?
>
> > The related issue is that of identifying streams... the codec identifier
has
> > no bounds, there is no way to say this is the end of teh identifier, and
> > this is the rest of the header. In other words \001vorbis is pretty much
> > indistinguishable to \001vorbis2. How can you tell if the 2 is part fo
the
> > identifier or the rest of teh header ?
>
> Yes. It's well defined in specific codec specs, but more flexible in
general.
> Just looking file-magic style at some of the initial bytes should always
> work.
>

It *should* work but there are no gaurantees, the only assurance currently
is that the number of defined headers is small. Currently the shortest codec
header i know is flac (4 bytes), so in order to do any successful mapping
you can only use 4 bytes... otherwise any longer and flac will never be
uniquely identified. So effectively the number of significant bytes of the
codec header is 4, if only four are significant, that makes the rest
insignificant and prety much just sugar. What if someone decides to
imlpement a picture codec with a header jpg.. then we are down to 3. If you
have an old version of the demux, that thinks 4 is the magic number, you
will never be able to uniquely identify this length 3 header as the 4 th
byte will likely be variable.

At least if the header was null terminated or fixed size, this is not an
issue andi don't see how that restriction imposes any great issues. Or in
the reverse i don't see what advantage you get by having arbitrary
identifiers. The whole purpose of the identifier is to identify, ideally
uniquely. So why not enforce it rather than rely on people to *hopefully*
create identifiers that don't cause conflicts.

> > Using the start stamp scheme we can resync as we hit a page. As we get a
> > page we know what time this page starts at.and we then have a reference
> > point to determine start and end times of every subsequent sample in
that
> > stream. this means less seek back.
>
> This is another good example of problems with the end-time granule.
Thanks.
>
> > As for stream duration, i see no problem with having an empty EOS page
which
> > has the end time in it.
>
> The only problem here is that you can't rely on the page being there (the
stream
> might be truncated, and in fact my explicitly be so in Ogg Vorbis). So
it's
> sugar, not something that's 'built-in' to the format design.
>
> > But from the sounds of it, this isn't the general consensus.
>
> Dunno. Sounded like Aaron was on your side. :)
>
> Cheers,
>  -r
> --- >8 ----
> List archives:  http://www.xiph.org/archives/
> Ogg project homepage: http://www.xiph.org/ogg/
> To unsubscribe from this list, send a message to
'theora-dev-request at xiph.org'
> containing only the word 'unsubscribe' in the body.  No subject is needed.
> Unsubscribe messages sent to the list will be ignored/filtered.
>
>
>

<p>--- >8 ----
List archives:  http://www.xiph.org/archives/
Ogg project homepage: http://www.xiph.org/ogg/
To unsubscribe from this list, send a message to 'theora-dev-request at xiph.org'
containing only the word 'unsubscribe' in the body.  No subject is needed.
Unsubscribe messages sent to the list will be ignored/filtered.