[vorbis-dev] More Ogg Video discussion

Wed Oct 4 08:14:05 PDT 2000

Kenneth Arnold wrote:
> 
> Okay, _finally_ getting something back on this. Let's see if we can't get
> things moving again.
> 
> On Fri, Sep 15, 2000 at 01:02:24PM +0200, Jelle Foks wrote:
> > Kenneth Arnold wrote:
> > >
> > > On Thu, Sep 14, 2000 at 02:30:17PM +0200, Jelle Foks wrote:
> > > > See my in-line comments.
> > >
> > > Ditto.
> > >
> > > > Kenneth Arnold wrote:
> 
> [snip]
> 
> > > > > * Three levels: packet, frame, and field. Packet holds all the stuff that
> > > > >   should naturally go together and is otherwise worthless when split up.
> > > > >   (I'm thinking streaming here). Field is collection of packets that
> > > > >   describes part of a frame. It may pull information from a lot of sources,
> > > > >   e.g., raw image data, data from frames earlier / later (with an arbitrarily
> > > > >   adjustable window), "scratch" area, whatever. It should have the capability
> > > > >   to embody vector graphics, arbitrary transforms, effects, etc. even if the
> > > > >   encoder can't pick them out from a source video (if it could, that'd be
> > > > >   great, but that gets very compex). Maybe field == packet; I need to think
> > > > >   some more about that. But by "part of a frame", I mean a level of detail
> > > > >   as opposed to a region (although region might be useful also). Object
> > > > >   descriptions are hierarchical in importance by nature; the codec should
> > > > >   take advantage of this. Coding should be done residually, i.e., take as
> > > > >   much information about the frame as can be embodied relatively simply, and
> > > > >   repeat with what's left over. The amount of complexity per independent
> > > > >   block should be adjustable over a wide range. Each block iteration
> > > > >   (hierarchical level) could be assigned a priority, and when streaming, the
> > > > >   transport could choose to only send the blocks above priority x. Different
> > > > >   methods could be used to formulate these blocks, possibly even different
> > > > >   methods for different blocks describing the same area. This would allow
> > > > >   motion estimation to be used for entire objects, and e.g. wavelets for
> > > > >   details about the object. The definitions and implementations of the
> > > > >   residue and coding areas are left for later, to allow for more than
> > > > >   enough flexibility (I hope).
> > > > > * Every frame should be able to reference back to frames before it, i.e.,
> > > > >   no MPEG's I frames (except maybe at the beginning of the stream).
> > > >
> > > > If there are too many dependencies upon 'previous data', such as what
> > > > happens when you send/store I-type image data only very occasionally,
> > > > then you will have very slow or difficult seeking, channel zapping, etc.
> > >
> > > This is why such things should be adjustable through a wide range.
> >
> > True.
> >
> > > Depending on the quality of the algorithm and tuning that can be done,
> > > the default should behave much like conventional streaming. But leave
> > > in the design the ability to incorporate more diff data, as is suited
> > > for the specific application (e.g., with a DVD, you can read data off
> > > pretty fast and get back to a far-away I-type frame without much delay,
> > > but an NLE setup could reprocess the data (the format should be structured
> > > to make this easy) so that almost every frame is I-type).
> >
> > Getting back to the last I-type frame is not enough if you have
> > differential frames following it that can be reference frames themselves
> > (P-type). In that case, seeking to a random point in the video stream
> > requires seeking to the previous I-type frame plus decoding of all
> > P-type differential frames after that until the desired seek
> > destination.
> >
> > But of course, in some applications, such seeks may be acceptable. I
> > agree that for some applications, a couple of seconds delay when seeking
> > may not be a problem for the user.
> >
> > Even if you want maximum compression with best quality without caring
> > about seeking time or transmission errors, there is an optimal balance
> > for using I-frames, P-frames, and B-frames. Just sending as few as
> > possible I-frames (many bits) and as much as possible B-frames (few
> > bits) doesn't result in the best compression.
> 
> Must be able to optimize for cases like DVD, too -- no transmission errors,
> fast reads, decent computing power to the decoder == large seek requirement
> generally okay, little to no redundant data.
> 
> You bring up the "optimal balance" -- my vision near the start was that the
> data could be encapsulated in such a way that restructuring the same data
> to do, say, large redundancy vs. small redundancy would involve a minimal
> amount of recoding. I realize that this is logistically quite difficult, but
> given a proper framework (i.e., subdivided frames), this could conceivably
> be quite useful. Still thinking about the specifics, but if multiple schemes
> are used for compression of the same frame, it seems like each could be
> packaged more or less independently with not too much more work. Anybody?
> 
> > When B-type frames are used, the seeking problem disappears, but you get
> > stuck with more difference entropy with the reference frames because the
> > reference images were taken too long ago and don't resemble the current
> > images anymore. If you wait long enough, then each B-type frame contains
> > the entropy for two images, one to cancel out the reference image, and
> > one to encode the new image (in which case it's often better to use
> > I-type macroblocks in the B-type frame). That's why they're using P-type
> 
> So you detect scene switches (likely a simple loss in ability to find
> a redundancy for the two images) and stick in an I-type frame. 

When choosing I, P, and B-type image coding strategies, there are a lot
of aspects related to encoder latency, memory usage, and complexity, and
decoder latency, memory usage, and complextiy. It's easy to overlook one
aspect when trying to optimize for another.

Scene switches are not the only cause of degradation of the B-type frame
efficiency. For example a sequence of scene variations, slowly replacing
the objects on the screen with new objects completely replaces the
screen, without scene switches, resulting in loss of efficiency for
predicted images.

Of course, an optimal encoder not concerned with seek time, and encoding
and decoding latency will the longer and longer sequences of B-type
frames until the 'profit' of using longer B-type sequences diminishes,
and then insert a P-type frame, and occasionally an I-type frame to
counter cumulative errors.

The optimal lenghts of the sequences of B-type frames of strongly
depends on the video scenes being encoded.

However, it is _very_ intensive on the encoder side to _find_ the
optimum, so at best they can hope to approach the optimum. The reason
why it's intensive to find the optimum is the bidirectional prediction
for B-type frames. The B-type frames are predicted from both temporally
neigboring reference frames, so from the one preceding it, and also the
one following it. Hence, B-type frames can only be encoded (and examined
for size after compression) _after_ choosing which frame will be the
next reference frame. So, for each sequence of B-type frames, all B-type
frames must be compressed once for each evaluated position for the net
reference (I or P-type) frame.

For live video, where the seek problem is not an issue, there still will
be a limit that limits the length of a sequence of B-type images: Since
the temporally later reference frame is needed by the decoder to
reconstruct the B-type images, it must be transmitted before tranmitting
the B-type images inbetween the reference images. Hence, the longer the
sequence of B-type frames, the longer the encoding latency.

> Possibly
> save the redundancy / motion comp / background data buffer and use it
> again if/when camera switches back to first scene. 

AFAIK MPEG4 allows this with it's object structure. In order to transmit
a video stream, the encoder sends both object definitions plus a 'scene
definition'. The scene definition defines which objects are to be placed
(or rendered) on viewers' screen. I believe that with BIFS, these
objects can even be 3d-models with mapped 2d-textures rendered just like
graphics in Doom/Quake. The 'data buffer' you're referring to is made of
the objects that are not visible.

> My earlier "progressive
> build" idea was based on being able to retranmit a buffer like this over
> many frames after the scene switch, so if the receiving end didn't pick up
> on the previous scene, it would be able to get it eventually and reconstruct
> this data (if saving the stream) or just do the best it could without the full
> data (if streaming), or somewhere between the two with variable buffering. 

So in fact, instead of sending each reference frame once, you'll be
sending each reference frame at least once. Lets evaluate by example if
that is feasible. Consider a system with B-type and I-type images only,
with sequences of B-type images separated by I-type reference frames. 

On average, a decoder will have 'missed' the first half the B-type
sequence.

In this example, reference frame #0 and #1 are the two reference frames
for the B-type sequence in the example, and #2 is the temporally later
reference frame for the next B-type sequence. For reconstruction of the
current B-type sequence, both reference frame #0 and #1 are needed in
the decoder.

Normally, for a given sequence of B-type images, reference frame #0 is
already known beacuse it was the temporally later reference durig the
previous B-type sequence. That reference frame becomes the new
temporally preceding reference image. The new sequence is transmitted by
first sending reference image #1 followed by the B-type images that are
temporally in between reference #0 and #1.

For any complete successful reconstruction of the B-type sequence
_after_ the current, the decoder will need to reconstruct the reference
image #1 from the reduntant data. If you start the second transmission
immediately with the first B-type image in a sequence, then a decoder
will only have a full reconstruction if it has been listening since that
first B-type image. When you start the redundant transmission halfway
during the B-type sequence, then full reconstruction happens on average
50% of the time.

So, on average, this will result in reducing the seek delay by half the
average duration of a B-type sequence, and to do that we're sending each
reference frame completely twice.

Then why not simply double the reference frame rate. That results in the
same, with much less complexity, and less encoding delay.

I think the medicine is as bad as the disease here. If it's possible to
reconstruct the data from the received stream, then there is no reason
for -any- decoder to keep it in a buffer in the first place.

If I understand correctly, you're suggesting to minimize the reference
data to allow maximum compression, and then re-add the reference data to
fight the adverse effects. I do think you'll end up where you left off.

> But
> if the receiving end did get the previous stream, this could improve compression,
> or at least provide the means to keep bitrate more or less constant (for streaming)
> (i.e., useful data in queue waiting for filler if a frame compresses well). 

The only effective way to make a bitstream more constant is by reducing
the peaks. If you're not reducing the peaks, then you'll still need the
maximum bandwidth available if you don't want to drop frames. If you're
filling up the valleys, then you're just increasing the probability of
frame drops when tranmission hickups occur during the valleys. If you're
filling up the valleys with data that is absolutely required during
decompression, then you'll definitely be more vulnerable to transmission
hickups (more data to loose). If you're filling up the valleys with data
that is not required for sucessfull decompression, then any decompressor
can completely ignore the data, so it's better to leave it out.

Filling in the valleys is always easy. The peaks occur at the
hard-to-compress scenes. Those are the scenes where there is no room
left for redundant data, you're filling up the valleys without lowering
the peaks, hence the necessary bit-rate for transmission without frame
drops is not reduced.

I don't see any reason why a stream with a higher average bit-rate an an
equal maximum bit-rate is any better. It also makes combining multiple
streams on a single transmission channel a less feasible.

> And
> if the file is streamed and saved at the same time (think TiVo), that the decoder
> could go back and fill in the earlier blanks with the later data. See IDCT problem,
> however.
> > frames. The problem of cumulative errors in P-type frames is handled by
> > requiring regular transmission of I-type image data (does not have to be
> > an entire frame, as long as each macroblock is transmitted as I-type at
> > least once every 230 or something frames). If supported by the decoder,
> 
> Right, you could get by with staggered Is and Ps -- this might be very useful
> for frames in which detail varies greatly by location in the frame.

Each P-type frame adds noise at the level of the DCT mismatch,
independent of the magnitude of the image differences, unless the DCT
mismatch is lower at lower coefficient magnitudes (lower image
differences). AFAIK, In the current standards, the maximum allowed DCT
mismatch is not a function of the coefficient magnitude. Hence, the
maximum allowed time between two I-type images (or image parts) does not
depend on the image content.

> > and with a well implemented encoder, the effect of cumulative errors can
> > also be removed (in the more recent standards) by having the the encoder
> > define which approximation it is using for the IDCT for the reference
> > frame reconstruction, so that the decoder can optionally use exactly the
> > same method. The cost of that is that the decoders must support every
> > possible IDCT approximation (there are a lot), and/or limiting the
> > encoders and decoders to one 'golden standard IDCT approximation', which
> > limits the degrees of freedom for cost-effective (hardware)
> > implementations.
> 
> ... assuming such a match is possible (maybe). Definately it is too much to
> expect every encoder to support every IDCT, but maybe two or three would be
> acceptable for quality? (e.g., a very accurate approximation, and a very fast
> approximation, then stuff in between)

Accuracy is not an issue at all for cumulative errors, because the
encoder uses reconstructed reference frames (after decompression). The
issue is a similarity between the IDCT that both the encoder and decoder
use for reconstruction. They can both be very inaccurate, as long as
they are exactly the same inacurate IDCT in both the encoder and the
decoder.

It's not the encoder that needs multiple implementations. Unless the
decoder can control which IDCT the encoder uses, the _decoder_ would
have to support every IDCT and match it with the one used by the
encoder, so that it can be sure to have the same reconstructed reference
frames as the encoder had.

Note that fast and slow are implementation-dependent properties. What
may be fast or small on one chip, hardware or software, may be slow or
large on another.

> > > > >   Okay,
> > > > >   so maybe there should be I-frames, but use them more carefully.
> > > >
> > > > If 3d transforms are used, then there is not much need for something
> > > > like a I/P/B-type frame concept, because you're looking at muti-frame
> > > > data coefficients in the transformed domain. Here, the depth in time of
> > > > the 3d transform is similar to the 'I-frame frequency' in 2d-transform
> > > > coding.
> > >
> > > Yup. But unless I've missed some very important development, no algorithm
> > > is yet perfect. A 3D transform may (hypothetically here) not be able to
> > > capture some specific sort of data as well as an I/P (and maybe B).
> >
> > I could be wrong of course but I believe the largest reason why 3D
> > transforms haven't been used often yet is cost and coding delay. A 3d
> > transform requires both the encoder and decoder to have sufficient RAM
> > for all decompressed frames in the 3d transform block (maybe not in the
> > decoder if some sort of progressive reverse transform is possible, which
> > I expect to always be at the cost of additional computations), and
> > possibly buffering of the transform coefficients because of the need of
> > coefficient re-ordering for efficient entropy coding. All this buffered
> > data costs money in hardware, and results in delay in the transmission
> > path. If you have to process N frames before you can finish the N-frame
> > 3d transform, then I think you'll have to do magic to get your
> > encoder+decoder delay below (or even at) N frames (because the last
> > transmitted bit may contain data about the first frame in the block).
> 
> I guess I just need to look more at 3d transforms to understand what can
> be done about that. Could you find some links for these beasts?
> 
> > The decorrelating effect of a well chosen transform has already proven
> > to beat predictive coding in the two spatial dimensions, so why not in
> > time? The main question that remains is: Can we afford the RAM and can
> > we live with the latency or find a magic trick to reduce it?
> 
> ... or can we find a way to use a variable-width block so it's not a yes-
> or-no decision like that? 

Unless you're accepting that during viewing the display stutters
('buffering...'), the decompressor will always have to initially wait
the longes possible transform delay before starting to display.

> A codec that is applicable (but also good) at
> a wide variety of situations should be accepted most readily. Magic tricks
> are good, too :) (but they can be hard to find)
> 
> > > Perhaps
> > > the best thing to do would be to allow for those sorts of "legacy"-type
> > > codes, but optimize the encoder for whatever works better most of the time.
> > > Or better, loosen the I/P concept to not just raw video data, but perhaps
> > > parameters for the 3d transforms or the left-over data after doing the
> > > 3D transforms, or various other metadata, whatever can be thought of.
> >
> > Agreed, at the 3d-block boundaries, there may still be some correlation
> > between the 3d transform blocks that could be reduced with some form of
> > prediction. I guess that all depends on the chosen 3d block size, and
> > the duration of the scenes (how long do image object properties remain
> > intact/predicatble?).
> 
> Hmmm... ask MPEG. Is it necessary that all the 3d blocks be the same size?

See my note above. If you don't wait enough initially, you'll have to
halt the display when you encounter a longer duration 3d block. So,
worst case delay counts if you care about playback without
interruptions.

Sure, you can maybe win some coding efficiency if you make shorter 3d
blocks at scene switches, but you won't reduce the delay, you still have
the initial worst-case-delay wait, or interrupted playback when you
encounter the longer blocks.

> Is it absolutely necessary that they all be the same size within a frame?
> If not, there's an encoder-side optimization here (could be pretty hefty).
> 
> > > Intersperce a brief general note here: You all likely know a lot more than
> > > me about how video compression works. But if we're going to get a DVD on a
> > > ZIP disk (MPEG-4 has conveniently upped the ante ;), it is obviously necessary
> > > to try something new, perhaps seemingly stupid, or totally random.
> >
> > "Hey, people will never fly, Mr Wright, go tell that to your brother."
> > ;-))
> >
> > But there's still room for some sceptisicm and not unwise to look at
> > what is currently percieved as the limit of what is possible, right?
> 
> Of course.
> 
> > According to Shannon, "Mr Information", we only need to encode the
> > information (entropy), and how much real information does a feature film
> > or television program really have? I'd say 50,000 pages of text
> > (=100mbyte?) may actually be enough to describe two hours of video in
> > detail (how long is the movie script?). The question is how, and how
> > long does the decoder need to reconstruct the pixels, and how much are
> > the encoder and decoder going to cost?
> 
> But not only do we only need to encode the entropy, we only need to encode
> the entropy that the average human will notice. Time is going to be an
> increasing non-issue as CPUs keep getting faster, and a codec that wants
> to last for any decent length of time should be able to max out even a
> future CPU. (Hey, I'd wait a day to decompress The Matrix if somebody
> got it down to 100 MB; poor me is stuck with 33.6k...)
> 
> > > But
> > > whatever results should have as much flexibility and power as possible, such
> > > that you can just throw in an encoder on the default settings and it'll work
> > > and maybe adapt on its own, but that you can endlessly tweak it so it is
> > > optimized for the specific use. It would be cooler that way ;)
> >
> > Agreed.
> >
> > > Go ahead, argue with me. You're likely right, but that's something I'll
> > > have to deal with later.
> >
> > I'm not saying you're wrong, I'm just trying to shed some light on
> > issues you may be overlooking.
> 
> And I'm obviously overlooking a lot. Hopefully by the end of this I won't be :)
> 
> > > > the I-P-B frame types are a direct result of the current 2d transform
> > > > coding methods using predicive coding in the time domain. Back in the
> > > > old days, image compression method even did predictive coding in the
> > > > pixel domain, but when they stepped over towards transform coding, then
> > > > there was no need to keep doing that. The only place where prediction
> > > > remains is on the boundaries of the transforms: in 8x8 DCT coding (MPEG,
> > > > JPEG, H.26x), this is at the DC DCT coefficients plus in the time
> > > > domain.  In NxMxQ 3d-wavelet coding, prediction will only help at the
> > > > edges of the pixels and the group of frames that are transformed as a
> > > > whole.
> > > >
> > > > >   Possibly
> > > > >   a lossless compression could be made from them...
> > > >
> > > > Lossless compression can be made from any compression method where the
> > > > residual entropy is sufficiently low. Lossless compression doesn't
> > > > require I-frames.
> > >
> > > Wrong thought connection. That's why I normally don't write after 10:00 PM.
> > >
> > > >
> > > > >   but back to the main
> > > > >   issue here: a typical viewer will be watching the video for at least 100
> > > > >   megabits before [s]he even starts to worry about quality as opposed to
> > > > >   content.
> > > >
> > > > Unless the viewer is receiving the stream over a 56k POTS modem or
> > > > similar. Even on 512kbit ADSL that is still more than 3 minutes.
> > >
> > > Bad number. Maybe the point was (can't remember exactly) to allow
> > > adjustment to whatever situation the streamer encounters. Yes, 512kbit ADSL
> > > should be able to get whatever quality it can, but a 15-20 Mb DTV should
> > > also should be able to get what it needs.  In that case, 100 Mb is 5-7 sec.,
> > > still off, but until the viewer has made the decision to keep watching a
> > > program, the video doesn't need to look perfect. If [s]he takes a long time
> > > in deciding, the quality should be improving anyway.
> > >
> > > Clearer explanation: when viewer first starts receiving a stream, quality
> > > doesn't really matter; the video could just show rough, blocky object outlines
> > > for a couple of frames until it gets more data;
> >
> > As a sort of fade-in effect? sounds good.
> 
> ... but as part of the codec. Again, adjustable as much as practical.
> 
> > > as the viewer becomes more
> > > preceptive to the details of the video, the quality should be improving at
> > > about the same rate as more of the dependent I-type data is sent, but spread
> > > out over potentially up to even 2 sec. worth of data before it is noticed.
> > > Better than waiting for the I-frame (MPEG context) to display anything.
> >
> > I doubt whether that can be done efficiently. The redundant transmission
> > of I-type data may not be able to keep up with reference frame changes
> > when P-type frames are used. So, either refrain from using P-frames as
> > well, accepting the B-type frame problems of gradually increasing
> > difference entropy, or increase the bit-rate of the redundant data. I
> > wonder whether the breakeven point here is anywhere better than simply
> > increasing the I-frame frequency. The latter at least is a lot less
> > complex (I mean implementation complexity here).
> 
> Then have provisions for redundant data, but note that if you can stand a
> bit of delay before the video comes up to full quality, the decoder doesn't
> even need to acknowledge its existence if it doesn't need it. But there will
> likely be some decoders that could have the extra complexity to benefit from
> some redundancy.

Using layered bitstreams where the base layer has low delay, and the
enhancement layers have longer delays will result in the same effect.

> > > Try again: the player should be smart enough to display something useful with
> > > only the diff data, however this gets used. The codec system should be
> > > structured so that this is easy. I suppose a real purpose for this will become
> > > more evident with further development.
> >
> > Use scalable bitstreams where the base layer (of lower resolution, image
> > quality, and frame rate) has a high frequency of I-type frames and give
> > the enhancement layers the much lower frequency of I-type frames. That
> > seems to me the obvious way to do get that effect.
> 
> yup. Sort of what I was getting at. 

The MPEG standards and H.26x standard all have support for layered
profiles.

> But more general; if we wind up not using
> an I/P/B-type system, that this framework could apply there too.

Agreed, but remember that it's a lot easier to solve it if you look at
it as a layered bitstream. Low-quality, low bit-rate and low latency
base layer plus a high quality, high latency, high bit-rate enhancement
layer. No need to consider the base layer as 'redundant data', when
you're tranmitting it, you can just as well use it (in MPEG and H.26x,
the base layer reference frames are used as a predictor for the
enhancement layer reference frames).

> > > > >   So I-frames can be very sparse.
> > > >
> > > > I'de hate to be able to seek only to 3-minute or more intervals or wait
> > > > up to three minutes after each seek because the decoder needs to
> > > > reconstruct sufficient 'history'. Also I'd hate to be able to zap
> > > > through channels at only one channel per three minutes.
> > >
> > > Maybe, but would you really complain that much if the video wasn't
> > > _perfect_ for 3 minutes, as an optional tradeoff for the ability to get
> > > better quality after that wait than without it?
> >
> > Sometimes I would, sometimes I wouldn't. When looking at a good movie, I
> > won't be seeking much anyway, but when fast-forwarding through boring
> > scenes of a recorder television program, I'd like to keep the high
> > quality. Then again, for fast-forwarding is easier to fix that problem
> > than for seeking. When channel surfing, I'll probably accept a lower
> > quality, as long as I quickly get sound and enough image to recognize
> > what they're broadcasting.
> 
> So if there is a hierarchical system, decode only the low-level frames
> when seeking.
> 
> > Yes this is of course subjective, and application dependent, so the best
> > way to deal with it is to keep it a flexible parameter, so that the
> > application or user can make the tradeoff.
> >
> > > > >   The tradeoff is more redundancy in the diff frames.
> > > >
> > > > Not completely. In your proposal all difference frames are changing the
> > > > reference frames, because each decompressed difference frame can be a
> > > > reference frame. In that case you have the problem of accumulated
> > > > errors. Especially in transform coding, where various implementations of
> > > > decoders may not be bit-true equal (due to various decoding
> > > > environments: processors, hardware (read: differences in rounding,
> > > > optimizations, efficiency and available data types)). After each
> > > > difference frame that is used as reference frame, the reference
> > > > available decoder deviates a bit more from the reference available in
> > > > the encoder, resulting in increased differences in the reconstructed
> > > > images.
> > >
> > > Consider my proposal to be missing in some major areas. As for errors, that's
> > > where both the redundancy and the adjustment come in. If the situation is
> > > more prone to errors, you first ask whether or not you really want to be
> > > working with video in that situation in the first place, and then you tell
> > > the encoder to increase the data redundancy, such that the errored data
> > > eventually gets replaced by correct data.
> >
> > I did not mean about transmission errors here, I meant that the IDCT (or
> > IWT) transform in the decoder can be implemented in different ways that
> > each give slightly different results. Two different implementations may
> > have the same precision (calculated as MSE with the full precision
> > (slow) IDCT), but different exact output. If you use fixed-point, which
> > is cheapest in hardware, then changing the order or structure of the
> > calculations, then you get different quantization (roundoff errors).
> > Also, when the algorithm calls for specific functions, such as the
> > logarithm, then it's very expensive to use full precision logarithms
> > (costs many computational iterations to calculate), so it's very nice to
> > be able to use an approximation function for the logarithm, of which
> > many variations are possible as well, each of the resulting in slightly
> > different results. Some hardware may also do fixed-point rounding a bit
> > different that other hardware (see for example "MAD", Rob Leslie's
> > fixed-point implementation of MPEG1/2 audio decoding for various
> > fixed-point platforms).
> 
> Decoders in hardware can be a pain to deal with, yes. But with CPUs shrinking
> in power consumption and cost and yet getting ever faster (Crusoe anyone?), how
> much longer is the pure-hardware, non-FPU model staying around?

Pretty long, especially for encoders that at least try to be efficient.
Remember that a general purpoce processor (CPU) is also hardware,
created with the same transistors as are special ASICs. The difference
is that the CPU is not specifically optimized for the one special task,
but for the average task. General purpose processors are always more
expensive than ASSPs or dedicated hardware, unless the production
quantity is low. ASSPs and dedicated hardware always win on power
consumption over general purpose hardware when comparing on the same
ASIC technology.

The only thing that happens is that the minimal chip complexity (due to
'pad limitation' on the ASICs) or minimal production quantity (due to
the increase of initial cost of production) for the breakeven point goes
up, but it's still waay out of reach for video.

> > > Perhaps the decoder could employ
> > > a trust system, whereby it would track how many lossy processing steps it
> > > has gone through and if unicast, ask the server to send the data again, or
> > > in multicast, wait for new data, which could itself need processing before
> > > it becomes useful, so compare trust values to know which to use and which
> > > to throw away.
> >
> > I think it's best up to the encoder to keep track of error accumulation.
> > In unicast applications you can use a negotiation sequency to see if the
> > encoder and decoder have matching implementations, or at least find the
> > implementations with the best match. Then, the encoder will know when
> > the cumulative errors become too large. In a broadcast situation, the
> > encoder has to assume the worst match within spec, or some viewer with
> > an full spec decoder may still be stuck with a bad video quality.
> >
> > In a multicast situation, it may be possible to use scalable bitstreams
> > here as well, where each receiver subscribes to the base layer, plus the
> > enhancement layer that best matches its decoder implementation plus
> > bandwidth budget. Of course, this means that the encoder needs to encode
> > the video with multiple implementation variants. In one-to-many
> > transmission situations, the encoder cost is less of an issue anyway.
> 
> The idea (as stated above and earlier) was that re-encoding should be easy.

That may be hard to achieve

>
> Think MPEG2 to DivX (i.e., MPEG4) -- how long that takes? 5-10 hours for a DVD?
> That's not a small issue. Even inter-MPEG2 might be difficult; I've never tried.
> So inter-Ogg should be, like Vorbis, easy to just strip out the excess and reencode,
> except not only the bitrate can change here, but the rundundant data, I-frame
> frequency, etc.

Recoding in lossy compression is always very tricky and almost by
definition results in additional loss of image quality. The only really
feasible recoding is when you're deliberately recoding to a lower
quality or a higher bit-rate.

> > > > >   Each diff frame should transmit the diff, plus some
> > > > >   data that the viewer should know if it's been watching since the last
> > > > >   I-frame.
> > > > >   This would allow streaming to be able to take advantage of scene
> > > > >   similarity without worrying too much about the consequences of lost data.
> > > > >   Possibly the redundant data could have a temporal component attached also,
> > > > >   so when the video is saved to disk after streaming, it could be moved to
> > > > >   the proper place where it should have been first introduced and then
> > > > >   removed as much as possible to keep redundancy to a minimum on a fixed
> > > > >   medium (key point: the stream is not the compressed video. They work together
> > > > >   but both can be modified to hold the same or similar data in a more optimal
> > > > >   manner). Another key point: there's a lot you can tune here (amount of
> > > > >   redundant data transmitted, frequency of I-frames, etc.). More flexibilty.
> > > > > * VBR of course. But since streaming often works best when bitrate is constant
> > > > >   (TCP windows, if streaming over TCP), allow the redundant data to be filled
> > > > >   in whenever the data size is otherwise small.
> > > >
> > > > If the bitstream can occasionally have a higher bit-rate than the
> > > > transmission medium, this results in latency (due to buffering).
> > > >
> > > > Dropping frames is not a good solution here, because that is nothing
> > > > more than very bluntly reducing the VBR ceiling, which can better be
> > > > done inside the coding algorithm.
> > >
> > > In almost all cases, the streamer should have knowledge of the approximate
> > > bit-rate ceiling for the medium, at least on average. Even a multicast
> > > streamer could put out different streams for differently connected viewers
> >
> > That's where H.26x and MPEG use scalable bitstreams, with a base layer
> > of low bit-rate and one or more enhancement layers for better video
> > quality for receivers that can handle the bit-rate.
> 
> Yes, hierarchical coding. That's just for the DCT coeffients and motion vectors
> I presume? 

Nope: Image size, frame rate, DCT coefficient precision, everything.

> Our codec as I propose would have a lot more that could be hierarchically
> coded. It would be interesting if somehow the hierarchical video coding could
> integrate with the double hierarchy of COFDM or another _broadcast_ system so that
> the real benefits of hierearchy show (so in bad reception areas you at least get
> part of your stream, instead of the usual choppy audio and frozen video).

That channel coding, which should not be confused with source coding:
the compression itself. If you mix the two you're likely to end up in a
suboptimum of non-flexible system. Simply use better feed-forward error
correction (reedsolomon, turbo codes) on the base layer to achieve what
you want.

> > > (and the alteration should be easy because the data should already be
> > > split and prioritized, and varying output bitrate should involve little more
> > > than just cutting the low-priority (i.e., fine detail) data and adding the
> > > appropriate amount of redundant data, difference coding, etc. that the
> > > format should also make easy.
> > >
> > > The latency thus needs to be adjustable also. In some situations it matters,
> > > in many it doesn't. The codec shouldn't be encumbered by dealing with both
> > > the same way.
> > >
> > > For the dropping frames, that was almost _exactly_ why I had the frame-
> > > independence idea. That way, no matter how important the data in that frame
> > > was, future frames should be able to use it anyway. Also, instead of just
> > > dropping the frame, if the decoder has a buffer of sufficient size (recalling
> > > Real's prebuffering of even an audio stream over 33.6 connection... horrors!
> > > but it worked quite well...), the frame could be interpolated from the ones
> > > before and after, and if the redundant data comes in within the buffer window,
> > > it could use that to further reconstruct the frame.
> >
> > Interpolation is not a way to recover lost information, just a method to
> > occasionally reduce percieved distortion. For example, interpolation
> > does not work at all on scene change boundaries.
> 
> But in some cases it may just happen to work nicely (e.g. a steady pan).

If it works good, then the entropy in the image is low, and a good
encoder would have encoded the frames in a few bits already, so there
wouldn't have been a bit-rate peak. When interpolating frames that have
been dropped because there is a bit-rate peak, the frames are hardly of
low entropy.

> > > Hmmm... the idea just popped into my head of a variable _frame_ rate...
> > > seems interesting to me as an option; what do you think? What could be done
> > > with it? Removing constraints one at a time... ?
> >
> > I think both H.26x and MPEG allow for 'dropped frames' as an encoder
> > decision, which is a rough quantization of variable frame rate, because
> > the encoder can regulate the frame rate that way. I think it's mostly a
> > very rough measure to deal with instantaneous bit-rate increases, but it
> > may of course also be applied as very high compression where two
> > sequential frames have no difference at all.
> >
> > Maybe there is a solution in-between, where the encoder can tell the
> > decoder: "I'm skipping this frame, and if you want to reconstruct it I
> > suggest predicting the motion yourself if you can and extrapolating" or
> > "don't try to predict motion here, just repeat the last frame" or "just
> > extrapolate without motion estimation". Note that skipped-frame
> > _inter_polation in the decoder introduces another 40ms (1 frame) delay
> > in the decoder.
> 
> Agreed. Maybe instead of your yes-or-no reconstruct, the encoder could send
> this kind of suggestions on the macroblock level, e.g. say reconstruct this
> macroblock, but this other one is basically the same.

Be careful adding data at a low level: you could very well end up with a
high bit-rate increase. A lot of small suggestions encode to one hell of
a lot of bits.

Experiments will have to show for each and every of such low-level
'tricks' if they work (Heck, that's what the ITU guys making H.263
version 3 have been doing for years now).

> > > > > * Scratch pad to save previous data. e.g. if scene is switching between two
> > > > >   talking heads, should save data associated with one when switching to other.
> > > >
> > > > AFAIK, MPEG4 solves that by separating object descriptions and image
> > > > structure. In other words: in MPEG4, not all known and/or previously
> > > > known objects must be displayed at all times. This allows an encoder to
> > > > 'keep' some objects across scene switches.
> > >
> > > But how does it deal with picking up the stream in the middle? Need to know
> > > because I/we need to do better :)
> >
> > AFAIK: with I-type descriptions of the objects.
> >
> > > >
> > > > >   Key point is that maybe viewer didn't catch that old data; maybe send it
> > > > >   before stream starts playing, or put it in the redundant frames. First
> > > > >   sounds nice if you're not multicasting; second is more suited for
> > > > >   broadcasting.
> > > >
> > > > It's all dependant on the application, many applications won't accept
> > > > the latency and other problems you get if you trade everything off
> > > > against maximum compression. The Ogg Video codec should be able to
> > > > produce the perfect stream for each application, but not every Ogg Video
> > > > stream has to be perfect for each application. Hence, keep all that
> > > > parametrizable, and keep as much of the details outside of the standard
> > > > and codec, let the application decide which parameters tickle it's
> > > > sweetspot. I think a video stream format is best kept simple: KISS (keep
> > > > it simple, stupid).
> > >
> > > On the dot; exactly what is first on my mind. But there are certain things
> > > needed in the codec to allow for maximum parameterization. So the KISS part
> > > becomes the framework under which you should be able to throw in just about
> > > anything imaginable that the viewer could sanely deal with.
> >
> > Agreed.
> >
> > > Among those
> > > things is the varying algorithms by region, residual-based codes,
> > > backreferences, and all that other sweet stuff.
> >
> > Yes, definitely.
> >
> > > But data should be very
> > > componentized; seperable as much as possible while not impeding the ability
> > > to remove redundancy. Maybe wavelet-lets? :)
> >
> > Just be careful that the flexibility may come at a cost: complex
> > standards take a long time before there are many applications for it,
> > are error-prone due to differences in interpretation of 'how it was
> > meant', and may create too much overhead in the bitstream, software code
> > size, or hardware controller size.
> >
> > Sometimes trying to change a working algorithm to remove it's
> > disadvantages may also remove it's advantages. I all Ogg Video people
> > (we, you, them) should just try it out and see what we can do.
> 
> I need to start coding _something_ to get a feel for working with video...
> 
> > > >
> > > > > * Assume viewer knows everything about the stream you sent, then either the
> > > > >   viewer could ask (unicast better again) or the streamer could just resend
> > > > >   anyway (multicast) the missing data.
> > > > >
> > > > > Spewing a lot to myself above, and I really didn't mean to spew that much,
> > > > > but chew on it and tell me what you think. That's the product of probably
> > > > > about 15 minutes of mostly continuous thought that is very likely disjointed
> > > > > and missing some key information still locked somewhere in my head, so don't
> > > > > take it as written in anything but sand sprinkled in tide pools. It's also
> > > > > 11:00 PM local time, so I may have gone insane and not known about it.
> > > > >
> > > > > The bit of judgement in me that hasn't gone to sleep yet is telling me that
> > > > > this is a good place to stop.
> > > > >
> > > > > Kenneth
> > > > >
> > > > > PS - I'm going to really like reading that when I'm more awake. It'll be fun.
> > >
> > > It was -- and I came up with a couple more ideas. Throw as much back as
> > > you can; discussion is good. And again, don't accept my ideas as solid,
> > > right, or wrong, but more as (hopefully) a source for inspiration,
> >
> > Ditto!
> >
> > > if any of
> > > that still exists in this world.
> >
> > I sure hope so!
> 
> me too.
> 
> >
> > > Kenneth
> >
> >
> > Cya,
> > Jelle
> 
> Kenneth
> 
> >
> > --- >8 ----
> > List archives:  http://www.xiph.org/archives/
> > Ogg project homepage: http://www.xiph.org/ogg/
> > To unsubscribe from this list, send a message to 'vorbis-dev-request at xiph.org'
> > containing only the word 'unsubscribe' in the body.  No subject is needed.
> > Unsubscribe messages sent to the list will be ignored/filtered.
> 
> --- >8 ----
> List archives:  http://www.xiph.org/archives/
> Ogg project homepage: http://www.xiph.org/ogg/
> To unsubscribe from this list, send a message to 'vorbis-dev-request at xiph.org'
> containing only the word 'unsubscribe' in the body.  No subject is needed.
> Unsubscribe messages sent to the list will be ignored/filtered.

--- >8 ----
List archives:  http://www.xiph.org/archives/
Ogg project homepage: http://www.xiph.org/ogg/
To unsubscribe from this list, send a message to 'vorbis-dev-request at xiph.org'
containing only the word 'unsubscribe' in the body.  No subject is needed.
Unsubscribe messages sent to the list will be ignored/filtered.