[Theora-dev] Theora, MMX and optimisation

denpo eyecore at gmail.com
Mon Apr 11 16:48:06 PDT 2005

Hi everyone,
I just landed into the theora planet, as a game programmer, I searched
for a free video fomat/codec and the theora choice became obvious.
However I experienced rather bad performance (at least from a game
programming point of view)
After a couple a profiling, I discovered, as previous discused in a
post found via Google, that the bottleneck is in the ogg library. An
unsane part of the CPU time was consumed in a single function :
This was even more visible that I'm compiling the MMX branch.

So, starting form this MMX branche, I made a rewrite of the GCC
assembly code to Visual inline assembly. By the way I tweaked them to
maximize instruction pairing.
Then I rewrited the oggpackB_read to reduce the number of tests, put
some assemby in it, and then made specialized version of it, just like
the existing oggpackB_read1, I now have a 8,16,24 and 32 bit version.
Then I replaced all the theora call that use a fixed number of byte to

I don't yet had time to make a detailed report on the speed gain, but
I'd say that it is noticeable.

I now have a couple of questions to submit to the community, here they are : 

-My hack must be crappy, I saw a couple byte-related function in
ogg... Any hints?

-The biggest probleme with these functions is that you can't assume
the datas to be read are byte aligned. However, I noticed that some
data are always byte aligned. Is it normal or just it is just my
sample video? being able to assume so would provide a big boost.

-What should I do with my version in more of putting the sources with
our realesed games? I'd love to share my work on theora, but my
version so far is somehow broken : I didn't wrote the Big Endian
counterpart of my functions, I replaced the assembly language (I read
somewhere this is plain wrong with theora guidelines). What should be
the best way to make my PC/Visua Studio/MMX/tweaked theora/ogg version
available to others?

Long life to theora!

Note: the cpu-consuming IDCT functions seem by their structure perfect
candidate for a SSE therapy. Never done before?

More information about the Theora-dev mailing list