[theora-dev] Re: Decoder accessing uninitialized variable... Re: [theora-dev]Using Theora in games?

Thu Feb 26 10:03:00 PST 2004

> For macro block modes which do not have a motion vector, the values
> stored in FragMVect are never used. There's a list of which is which in
> the ModeUsesMC array. You can verify that in all the cases in which
> MVect is not initialized, the entry in this array is 0, and so the
> values are never actually used.

Ahh, I see... thanks for clearing this up.

> yet, however. In theory the decoder should work (but possibly produce
> garbled output) regardless of the bit content of the input. My guess is

May I add that it if the output is garbled, it is prefered that it is always
garbled the same way? I noticed that splayer shows current contents of
hardware overlay buffer for one frame before actually decoding anything. You
can see last frame from the previous run of the program. Having something
like that in the decoder would make it very hard to debug. It is better to
have all buffers (like MVect here, but also including larger temporary and
decoding buffers) cleared to 0 on initialization, then guess whether
something like this will happen or not. Is there some reason (performance
maybe?) that it is not done so?

> Current tools sort pages by the end time of the last packet, which does
> not quite give optimal buffering. However, in practice the discepancy
> won't be more than a few video frames worth of audio.

Ok, we'll test this to see whether it buffers too deeply.

> The big numbers are 9*(w+32)*(h+32)/2 bytes for the 3 reference frame
> buffers required and 24*63*80 bytes (~118 K) used for the Huffman codes.

I assume that the 3 frames include what you actually get back from
theora_decode_YUVout(). I.e. the theora_decode_YUVout() function just
rearranges the internal pointers and strides and returns that as a pretty
structure, right?

> However, the current decoder also uses a ton of extra storage space for
> decoding the DCT tokens (more than 6*w*h/2 bytes) due to the ordering of
> these coefficients in the bit stream. I think it's possible to avoid
> this by scanning each packet twice, which may even result in a speed win
> for large frame sizes because of better cache coherency. No one's done
> this, or tested the results, however.

I'd volunteer to test this, if it is not beyond my comprehension of the
source. But I'd need you to point me to it.

> There are a number of other places
> the decoder could be significantly sped up, too, such as in decoding the
> Huffman codes themselves.

I took a liberty to run VTune over it and here is the stats on top-ten
cycle-spenders:

13.37% video_write
10.39% ReconInterHalfPixel2
 8.99% oggpackB_read
 8.27% ReconRefFrames
 7.78% ReconInter
 6.41% ClearDownQFragData
 6.04% FilterHoriz
 5.19% UnPackVideo
 4.54% FilterVert
 4.22% CopyBlock

Those are self-times (not including callees), from a 640x480 29.97fps
video-only stream. I can also run call-graph time analyses, if this above is
too moot. If I can read it right, Huffman decoding (UnPackVideo?) is not the
main problem. Also, perhaps some more agressive inlining might help here?
Note that I'm not familiar with the code yet, so maybe I'm talking rubbish.
If so, please do correct me. If someone can point me to a small and simple
function that could significantly benefit from peephole asm optimization, or
rewriting in MMX specific code, or similar, I could assign one or two of our
asm programmers to it sometime in the near future. (Some of Recon* functions
look like a potential MMX candidates.)

<p>Thanks,
Alen

<p>--- >8 ----
List archives:  http://www.xiph.org/archives/
Ogg project homepage: http://www.xiph.org/ogg/
To unsubscribe from this list, send a message to 'theora-dev-request at xiph.org'
containing only the word 'unsubscribe' in the body.  No subject is needed.
Unsubscribe messages sent to the list will be ignored/filtered.