[theora-dev] Re: Decoder accessing uninitialized variable... Re: [theora-dev]Using Theora in games?
Timothy B. Terriberry
tterribe at vt.edu
Thu Feb 26 09:51:17 PST 2004
> May I add that it if the output is garbled, it is prefered that it is always
> garbled the same way? I noticed that splayer shows current contents of
I think that's an issue with the way SDL is being used. There's probably
a way to clear the window before displaying it. But the actual codec
doesn't return garbage in the first frame.
> hardware overlay buffer for one frame before actually decoding anything. You
> can see last frame from the previous run of the program. Having something
> like that in the decoder would make it very hard to debug. It is better to
> have all buffers (like MVect here, but also including larger temporary and
> decoding buffers) cleared to 0 on initialization, then guess whether
> something like this will happen or not. Is there some reason (performance
> maybe?) that it is not done so?
It is in some places. Like ClearDownQFragData. You can see below what
kind of impact that has.
> I assume that the 3 frames include what you actually get back from
> theora_decode_YUVout(). I.e. the theora_decode_YUVout() function just
> rearranges the internal pointers and strides and returns that as a pretty
> structure, right?
Yes and no. If post processing is disabled, then pointers into these
reference frames can be returned directly. Otherwise, a separate buffer
needs to be used to store the post-processed video, because Theora does
not have filtering in the loop. I'd forgotten about that. Add another
3*w*h/2 bytes (Theora still returns pointers into its own internal
buffers in either case).
There was some call for an API which lets the application specify the
buffer to decode into. This allows decoding directly into video memory,
possibly writing only the updated blocks. The difficulty with this is
Theora needs to keep its own copy of each decoded frame with 16 pixels
of padding on each side (to handle unrestricted motion vectors), from
which lots of reads are performed during motion compensation (reads from
video memory being slow). It might be possible to write post-processing
output directly to an application-specified buffer, but I haven't really
investigated the PP code yet (which I need to do; there are parameters
in there that are quantizer-dependent, and thus should be in the stream
headers but currently are not).
> I'd volunteer to test this, if it is not beyond my comprehension of the
> source. But I'd need you to point me to it.
At this point I'm wary of investing more time in the current source, and
am instead looking to adapt the encoder I wrote from scratch to do
decoding, too. That encoder already does something extremely similar to
this pre-scan in order to accumulate per-fragment bitrate statistics,
though, so that would not be too hard to adapt. We can talk about it
more off-list, if you like.
> I took a liberty to run VTune over it and here is the stats on top-ten
> cycle-spenders:
>
> 13.37% video_write
This is essentially copying Theora's internal buffer into video memory,
and nothing more. The goal of decoding directly into video memory is to
eliminate this copy.
> 10.39% ReconInterHalfPixel2
> 8.99% oggpackB_read
This is the time spent reading bits from the bit stream. The problem is
that the current Huffman decoder reads one bit at a time, which is slow.
The way to optimize it is to read more than one in cases where the codes
form a small, complete sub-tree (often). Code to do this for that new
decoder I mentioned has been written, but is untested yet, as the rest
of the decoder hasn't been written.
> 8.27% ReconRefFrames
> 7.78% ReconInter
> 6.41% ClearDownQFragData
A little use of memset in here would probably speed things up a bit.
> 6.04% FilterHoriz
> 5.19% UnPackVideo
This 5% is all overhead due to the coefficient ordering. No actual
unpacking occurs in this function, just indexing and accounting. This
overhead increases as frame size grows and cache coherency falls.
> 4.54% FilterVert
> 4.22% CopyBlock
The interesting things missing from this list are the dequantization and
iDCT functions (which were assembly optimized in the VP3 source, but are
pure C in Theora for portability).
> Those are self-times (not including callees), from a 640x480 29.97fps
> video-only stream. I can also run call-graph time analyses, if this above is
> too moot. If I can read it right, Huffman decoding (UnPackVideo?) is not the
> main problem. Also, perhaps some more agressive inlining might help here?
> Note that I'm not familiar with the code yet, so maybe I'm talking rubbish.
> If so, please do correct me. If someone can point me to a small and simple
> function that could significantly benefit from peephole asm optimization, or
> rewriting in MMX specific code, or similar, I could assign one or two of our
> asm programmers to it sometime in the near future. (Some of Recon* functions
> look like a potential MMX candidates.)
Those functions already have MMX versions in the VP3 source, as do the
PP functions. This is available as the vp32 module in Xiph's CVS.
They were taken out of the theora module because the main goal of a
reference decoder is clarity, not platform-specific optimizations. Clean
and efficient design is a good thing for a reference decoder;
Improvements in this area will be accepted gladly. Hand-coded assembly
is not. I'm not discouraging you from pursuing the latter, as clearly
your goals are different than ours. I think someone maintains a patch to
add MMX optimizations to the Vorbis reference decoder, for example. A
similar thing here is not out of line, but I doubt they'd ever be
included in the main CVS module.
--- >8 ----
List archives: http://www.xiph.org/archives/
Ogg project homepage: http://www.xiph.org/ogg/
To unsubscribe from this list, send a message to 'theora-dev-request at xiph.org'
containing only the word 'unsubscribe' in the body. No subject is needed.
Unsubscribe messages sent to the list will be ignored/filtered.
More information about the Theora-dev
mailing list