[theora-dev] Re: Decoder accessing uninitialized variable... Re: [theora-dev]Using Theora in games?

Thu Feb 26 09:51:17 PST 2004

> May I add that it if the output is garbled, it is prefered that it is always
> garbled the same way? I noticed that splayer shows current contents of

I think that's an issue with the way SDL is being used. There's probably 
a way to clear the window before displaying it. But the actual codec 
doesn't return garbage in the first frame.

> hardware overlay buffer for one frame before actually decoding anything. You
> can see last frame from the previous run of the program. Having something
> like that in the decoder would make it very hard to debug. It is better to
> have all buffers (like MVect here, but also including larger temporary and
> decoding buffers) cleared to 0 on initialization, then guess whether
> something like this will happen or not. Is there some reason (performance
> maybe?) that it is not done so?

It is in some places. Like ClearDownQFragData. You can see below what 
kind of impact that has.

> I assume that the 3 frames include what you actually get back from
> theora_decode_YUVout(). I.e. the theora_decode_YUVout() function just
> rearranges the internal pointers and strides and returns that as a pretty
> structure, right?

Yes and no. If post processing is disabled, then pointers into these 
reference frames can be returned directly. Otherwise, a separate buffer 
needs to be used to store the post-processed video, because Theora does 
not have filtering in the loop. I'd forgotten about that. Add another 
3*w*h/2 bytes (Theora still returns pointers into its own internal 
buffers in either case).

There was some call for an API which lets the application specify the 
buffer to decode into. This allows decoding directly into video memory, 
possibly writing only the updated blocks. The difficulty with this is 
Theora needs to keep its own copy of each decoded frame with 16 pixels 
of padding on each side (to handle unrestricted motion vectors), from 
which lots of reads are performed during motion compensation (reads from 
video memory being slow). It might be possible to write post-processing 
output directly to an application-specified buffer, but I haven't really 
investigated the PP code yet (which I need to do; there are parameters 
in there that are quantizer-dependent, and thus should be in the stream 
headers but currently are not).

> I'd volunteer to test this, if it is not beyond my comprehension of the
> source. But I'd need you to point me to it.

At this point I'm wary of investing more time in the current source, and 
am instead looking to adapt the encoder I wrote from scratch to do 
decoding, too. That encoder already does something extremely similar to 
this pre-scan in order to accumulate per-fragment bitrate statistics, 
though, so that would not be too hard to adapt. We can talk about it 
more off-list, if you like.

> I took a liberty to run VTune over it and here is the stats on top-ten
> cycle-spenders:
> 
> 13.37% video_write

This is essentially copying Theora's internal buffer into video memory, 
and nothing more. The goal of decoding directly into video memory is to 
eliminate this copy.

> 10.39% ReconInterHalfPixel2
>  8.99% oggpackB_read

This is the time spent reading bits from the bit stream. The problem is 
that the current Huffman decoder reads one bit at a time, which is slow. 
The way to optimize it is to read more than one in cases where the codes 
form a small, complete sub-tree (often). Code to do this for that new 
decoder I mentioned has been written, but is untested yet, as the rest 
of the decoder hasn't been written.

>  8.27% ReconRefFrames
>  7.78% ReconInter
>  6.41% ClearDownQFragData

A little use of memset in here would probably speed things up a bit.

>  6.04% FilterHoriz
>  5.19% UnPackVideo

This 5% is all overhead due to the coefficient ordering. No actual 
unpacking occurs in this function, just indexing and accounting. This 
overhead increases as frame size grows and cache coherency falls.

>  4.54% FilterVert
>  4.22% CopyBlock

The interesting things missing from this list are the dequantization and 
iDCT functions (which were assembly optimized in the VP3 source, but are 
pure C in Theora for portability).

> Those are self-times (not including callees), from a 640x480 29.97fps
> video-only stream. I can also run call-graph time analyses, if this above is
> too moot. If I can read it right, Huffman decoding (UnPackVideo?) is not the
> main problem. Also, perhaps some more agressive inlining might help here?
> Note that I'm not familiar with the code yet, so maybe I'm talking rubbish.
> If so, please do correct me. If someone can point me to a small and simple
> function that could significantly benefit from peephole asm optimization, or
> rewriting in MMX specific code, or similar, I could assign one or two of our
> asm programmers to it sometime in the near future. (Some of Recon* functions
> look like a potential MMX candidates.)

Those functions already have MMX versions in the VP3 source, as do the 
PP functions. This is available as the vp32 module in Xiph's CVS.

They were taken out of the theora module because the main goal of a 
reference decoder is clarity, not platform-specific optimizations. Clean 
and efficient design is a good thing for a reference decoder; 
Improvements in this area will be accepted gladly. Hand-coded assembly 
is not. I'm not discouraging you from pursuing the latter, as clearly 
your goals are different than ours. I think someone maintains a patch to 
add MMX optimizations to the Vorbis reference decoder, for example. A 
similar thing here is not out of line, but I doubt they'd ever be 
included in the main CVS module.
--- >8 ----
List archives:  http://www.xiph.org/archives/
Ogg project homepage: http://www.xiph.org/ogg/
To unsubscribe from this list, send a message to 'theora-dev-request at xiph.org'
containing only the word 'unsubscribe' in the body.  No subject is needed.
Unsubscribe messages sent to the list will be ignored/filtered.