[vorbis] FREE at last

Thu Oct 12 07:08:08 PDT 2000

Kenneth Arnold wrote:
> 
> On Wed, Oct 11, 2000 at 02:43:38PM +0200, Jelle Foks wrote:
> > Kenneth Arnold wrote:
> > >
> > > On Sun, Oct 08, 2000 at 11:05:38AM -0700, Ralph Giles wrote:
> > > > On Sun, 8 Oct 2000, Aleksandar Dovnikovic wrote:
> > > >
> > > > > > Now a question about VORBIS
> > > > > > I wonder when the next release of VORBIS is sheduled for?
> > > > > > Are there gonna be some more options to choose from (like LAME)?
> > > > > > Joint stereo perhaps?
> > > > >
> > > > > Yes, I would like to know that too.
> > > > > Monty, can you supply us with some answers?
> > > >
> > > > Last I spoke with monty, the features on the todo list for vorbis 1.0
> > > > are generally:
> > > >
> > > > Channel Coupling. Meaning joint stereo and hopefully also (joint)
> > > >       ambisonic surround.
> > > >
> > > > Cascading support. This is the promised 'bitrate peeling' feature for
> > > >       trivial transcoding to lower bitrates.
> > > >
> > > > Lower-bitrate Modes. Combines with the above.
> > > >
> > > > I don't know that there's any firm schedule for this, beyond the
> > > > aforementioned new year deadline from mp3 licensing. The above will
> > > > probably take at least a two months together though, so I don't expect 1.0
> > > > before December at the earliest. Unless they punt. :)
> > > >
> > > > Makes sense to release another couple of betas in the meantime though. We
> > > > could actually do a beta3 real soon with just the bug fixes and code
> > > > re-org from the postbeta2 branch, but I'd wait until one of the above
> > > > features is added.
> > > >
> > > > <plug>
> > > > It would also be nice if we could get stream-description metadata in there
> > > > as well, if only to make it a less traumatic upgrade when tarkin happens.
> > >
> > > Tarkin? Where is that anyway?
> > >
> > > I've found some video codec stuff myself, and am seriously considering
> > > porting them over to the Ogg framework to ease playing around. Having
> > > not yet delved into code, I wonder about frame-sync issues -- how can
> > > I get a frame of video to match up with a position in the audio stream?
> > > Forgive me for asking if this is blatantly simple.
> >
> > Here's my 2ct worth:
> >
> > I don't think it's blatantly simple, at least for the decoder, because
> > that's just where you will find the synchronization problems. Not every
> > video frame will always have exactly the same amount of associated audio
> > samples, for the simple fact that the audio sampling clock may not be an
> > exact integer multiple of the video frame rate (especially when the
> > audio ADC and the video ADC each use their own crystals for the clock).
> > So, a frame of video can have a variable amount of samples assigned to
> > it on the encoder size, plus a decoder may require a variable amount of
> > samples.
> >
> > First an obvious but wrong way to do it, and then I'll suggest a correct
> > way to do it...
> >
> > What you may do is let the encoder assign samples to video frames at
> > will (it should be sufficient to indicate between which audio samples
> > the frame boundaries are, maybe numbering the frames so that recovery is
> > possible after packet loss). Then, the decoder can deal with it in
> > various ways, and I think it depends on the platform and application
> > which way should be used (there is no reason not to leave the option
> > open in the standard).
> >
> > For example, the decoder can make the video frame display dependent on
> > the time-base as given by the audio samples (N audio samples per second)
> > and display the video frames synchronized with the markers in the audio
> > stream. Or, the decoder can use the video frames as an absolute time
> > base (N video frames per second), and resample the audio samples so that
> > the audio stream conforms stays synchronized.
> 
> Markers in the audio stream would be in this case the timestamped Vorbis
> audio frames. We talked about this earlier; audio should be the master,
> video &c slave, or everything slave to some master clock.

I'd vouch for the latter, since it's the least restrictive.  For the
encoder, it still leaves open the option to use either the audio or
video clock as a base for the master clock. Plus, it allows for simple
separation of the video and audio substreams while keeping their timing
(think mixing & editing).

> > However, if then an encoder makes a stream specified as 44.1Khz audio
> > samples, and it's crystal is off, resulting in 44.2Khz audio samples
> > being transmitted, then the video stream will be structurally delayed by
> > more than 2%, which results for a live video stream in a gradual
> > increase of the delay, and the buffers at the receiver fill up.
> >
> > So, that's the wrong way... Here is what I think is the best way:
> 
> Which is basically exactly SMTPE timecode. Exact conformance with that
> standard should be not difficult and, as I have said before, possibly
> beneficial for integrating Ogg into a system that speaks SMPTE.

I don't know about SMTPE, but I agree with your reasoning.

> How to store this timecode data in the stream is a different question.
> We already have Vorbis, which has a loose timestamping as much as Ogg
> allows but it can be difficult to seek to an exact time (i.e., to the
> exact 30th or 60th of a second where a video frame is), requiring more
> smarts for the decoder. Then the decoder must also keep track of the video,
> which may or may not (likely not) have a related timestamp system. Even if
> it does, there's more video information in a second than audio, and this
> can get hard to keep track of, especially when seeking arbitrarily. My
> suggestion, then, is a (optional?) metadata stream that maps audio, video,
> and other time-dependent data from whatever they are broken down into
> (frames for video, blocks for audio) onto a standard, high-resolution time
> format, doing what Jelle suggests earlier with frame-numbering but assigning
> times to numbers. Perhaps this stream could be removed when the stream is
> sent to the end viewer, and the decoder could reconstruct it if it needs
> it (because it's all just convenience information; the timestamps would do
> the job just as well with just some more decoder work).

I would suggest always putting timestamps integrated with both the video
stream and the audio stream, because that prevents loss of that data
when it's transmitted through a medium that has decided to discard the
optional metadata.

If each frame of video and block of audio has a time code, then that
should be everything necessary for transmission.  Since your suggested
metadata stream then is fully redundant in information content, and its
function is allowing for faster random access, it's more like the index
file that sometimes accompanies a database. And since it's something the
decoder may or may not use, I think it's up to the decoder to define its
structure, create it and use it when it finds it necessary. That's why I
think the relevance of including a standard definition for such a
metadata stream will be mostly as an example, or a suggestion, because
no incompatibilities arise when different decoders use different
formats.

Unless you are thinking about a medium where the cost of additional bits
for the metadata stream is less than the winnings of instant simple
random access for various decoder implementations?

> > Define an absolute time base in seconds (or miliseconds, whatever). For
> > example, the number of miliseconds elapsed since the beginning of the
> > stream. But I guess there is no real reason not to use the system time
> > of the computer for it either.
> >
> > Then, insert the absolute time codes of each video frame with each video
> > frame, and insert the absolute time codes of each block of audio samples
> > with each block of audio samples.
> 
> (the metadata stream)
> 
> > Then, the decoder can compare the absolute time codes of the video
> > frames and audio samples, and determine when to display the frames, and
> > whether or not to resample the audio (or skip or insert samples when
> > resampling is too computationally intensive).
> 
> Audio first, video second. In any case, it's a lot easier on both the
> decoder and the person watching for the decoder to just drop video frames.

Agreed, audience testing has proven that lip-sync, frame drops, etc are
perceived as less disturbing than audio hickups and audio delay, so
audio has the priority over video if small latency and no packet loss is
unachievable for both audio and video.

But even in the 'perfect' system encoder and decoder without packet loss
and an  algorithm without latency there is jitter to take into account,
because the encoder and decoder are separate systems with no common
clock on which all ADC and DAC actions are synchronized. Some clocks may
be a bit faster than others, and there may be fluctuations.

Of course, A lot (most?) of the jitter will be temporary fluctuations,
so a lot of the effects will be dealt with by simply ignoring the jitter
if the time difference is below the threshold of perception (I believe
that's approx 1-2ms). But you can't simply make video a slave of the
audio in all applications: When the encoder audio ADC clock is
structurally a bit higher than the audio DAC clock in the decoder, the
amount of 'time' stuck in the buffer grows, so it may still be necessary
to skip samples or resample, if the tranmission duration is long enough.

> > In the event that the encoder clock is off, it's still not a huge
> > problem. For example if the clock at the encoder is 10% faster, then
> > after 10 seconds, according to the time stamps, the decoder will have
> > both 1 second worth of video data and 1 second worth of audio data in
> > it's input buffers. It can then reduce the latency in case of a live
> > stream by simply discarding the buffer content and/or compensating for
> > the time base differences (by playing 1100ms worth of video and audio
> > data each second instead of 1000ms). Time base variations at the decoder
> > side are dealt with automagically too in that case.
> 
> A live input source should have some sort of synchronization that can be
> trusted to be reasonably accurate, e.g. a good NTSC source will always be
> (insert whatever value it is that is very close to 30 fps here),

30 * 1000/1001 fps for NTSC. Often implemented by using 30 fps and
dropping one frame each 1000 frames (hence, there can be up to 33.3ms
jitter in the exact video frame times).

Maybe I'm being timing paranoid, but why not use a timing method that
_allows_ ludicrous precision timing at neglible cost over sloppy timing?
(an encoder can always decide to be sloppy about which time codes it
adds to the stream).

> by likely more accurate crystal than the computer, 

I wouldn't underestimate the effect of cost-reduction for the hardware
of cheap cameras. I'd like to see time stamps with the video frames as
well, especially if the source is one of those cheap webcams... Digital
cameras don't necessarily care about NTSC or PAL specs.

> and the audio is, of couse, synced
> to the time it enters the computer in the first place (which, in a live
> recording, ought to be very close to the time it was generated by the
> source, or something's messed up with our concept of physics and quantum
> mechanics). 

Or the sound card used for capture has a buffer (many do), or it's a
recording of an outside event at distances such as 300 meters (1 second
delay for sound).

Hmm, how to do distance estimation on sound sources ;-)))

> And the encoder should realize that the input buffer is growing,
> and compensate its clock much before it reaches 1 second of data. 

Unless the bit-rate variability demands 1 second worth of bits in the
receiver buffer to guarantee continuous playback (read up on the VBV
(video buffer verifier) in the MPEG documents for more info).

> In a live
> performance, you don't want 1 sec of buffer latency if it can be avoided.

If it can be avoided without side effects, it's always smart to do that.
Not only for a user-perception point of view, but also for a decoder
buffer RAM requirements point of view.

Of course, if it's one-way and long distance, then most viewers won't
even notive 1 minute of delay in live streams, unless it's a sports
event and the neighbours have less delay and are very agitated and loud
;-)) It's twodirectional live communications where lateny kills.

> > All that the Ogg standard needs is to define a place in the stream where
> > to add the time codes for the video and audio blocks/frames. The encoder
> > simply adds the time code when it grabs the samples or video frames, and
> > each decoder deals with the 'problems' by itself.
> 
> See my description above.
> 
> > Hope this helps,
> >
> > Cya,
> >
> > Jelle
> 
> You seem to know a lot. Why?

Why? I don't know why. Maybe I just talk a lot, or maybe there is actual
knowledge involved. Who is to tell? ;) And there's quite a lot that I
don't seem to know either. ;-)

Cya,

Jelle.

> Kenneth
> 
> > > > For my part, I'm just not sure Robert and I will have anything stable by
> > > > then. Some help with it would be greatly appreciated! :) We seem to
> > > > generally have consensus on the metadata elements, but not on how to
> > > > encode them, and I don't have a good handle on what we need to support in
> > > > the stream-description part.
> > > >
> > > > If we do it the way I want to, we need mng and xml substream support at
> > > > least, to span the required feature set, as it were. See my todo list at
> > > > http://snow.ashlu.bc.ca/ogg/todo.html for details.
> > > > </plug>
> > > >
> > > > Hope that helps,
> > > >  -ralph
> > > >
> > > > --
> > > > giles at ashlu.bc.ca
> > > > *crackle* The Director is a Humpback whale. Hold all calls. *crackle*
> > > >
> > > >
> > > >
> > > >
> > > > --- >8 ----
> > > > List archives:  http://www.xiph.org/archives/
> > > > Ogg project homepage: http://www.xiph.org/ogg/
> > > > To unsubscribe from this list, send a message to 'vorbis-request at xiph.org'
> > > > containing only the word 'unsubscribe' in the body.  No subject is needed.
> > > > Unsubscribe messages sent to the list will be ignored/filtered.
> > >
> > > --- >8 ----
> > > List archives:  http://www.xiph.org/archives/
> > > Ogg project homepage: http://www.xiph.org/ogg/
> > > To unsubscribe from this list, send a message to 'vorbis-request at xiph.org'
> > > containing only the word 'unsubscribe' in the body.  No subject is needed.
> > > Unsubscribe messages sent to the list will be ignored/filtered.
> >
> > --- >8 ----
> > List archives:  http://www.xiph.org/archives/
> > Ogg project homepage: http://www.xiph.org/ogg/
> > To unsubscribe from this list, send a message to 'vorbis-request at xiph.org'
> > containing only the word 'unsubscribe' in the body.  No subject is needed.
> > Unsubscribe messages sent to the list will be ignored/filtered.
> 
> --
> Kenneth Arnold <ken at arnoldnet.net> / kcarnold / Linux user #180115
> http://arnoldnet.net/~kcarnold/
> 
> --- >8 ----
> List archives:  http://www.xiph.org/archives/
> Ogg project homepage: http://www.xiph.org/ogg/
> To unsubscribe from this list, send a message to 'vorbis-request at xiph.org'
> containing only the word 'unsubscribe' in the body.  No subject is needed.
> Unsubscribe messages sent to the list will be ignored/filtered.

--- >8 ----
List archives:  http://www.xiph.org/archives/
Ogg project homepage: http://www.xiph.org/ogg/
To unsubscribe from this list, send a message to 'vorbis-request at xiph.org'
containing only the word 'unsubscribe' in the body.  No subject is needed.
Unsubscribe messages sent to the list will be ignored/filtered.