[Theora-dev] Theora, MMX and optimisation

Mon Apr 11 19:08:16 PDT 2005

denpo wrote:
> After a couple a profiling, I discovered, as previous discused in a
> post found via Google, that the bottleneck is in the ogg library. An

Switching to libogg2 could also help with this, as it does not need to
copy any of the packet data around like libogg1 does. The reference
implementation contains compiler switches to use libogg2 (though I think
using them would necessitate using Tremor instead of libvorbis for
Vorbis decoding, since I don't think libvorbis has been ported to
libogg2 yet).

> -My hack must be crappy, I saw a couple byte-related function in
> ogg... Any hints?
The only code that current takes any advantage of byte-alignment is the
writecopy functions, and they're not even used in the reference library.

> -The biggest probleme with these functions is that you can't assume
> the datas to be read are byte aligned. However, I noticed that some
> data are always byte aligned. Is it normal or just it is just my
> sample video? being able to assume so would provide a big boost.
The values in the header packets are byte-aligned for the most part, but
those need be read only once. The values in the data packets are not
normally byte-aligned to any significant degree.

> -What should I do with my version in more of putting the sources with
> our realesed games? I'd love to share my work on theora, but my
> version so far is somehow broken : I didn't wrote the Big Endian
> counterpart of my functions, I replaced the assembly language (I read
> somewhere this is plain wrong with theora guidelines). What should be
> the best way to make my PC/Visua Studio/MMX/tweaked theora/ogg version
> available to others?

oggpack_readB and friends _are_ the big endian versions... the others
are little endian.

I haven't been following the MMX branch of the reference implementation
in svn, but I would guess that at a minimum we'd want:
1) Confirmation that the output with your "tweaks" is bit-identical to
the unpatched reference decoder,
2) Backports of the tweaks to GCC's AT&T-style assembly (as an aside to
others, is maintaining two versions of all the optimized functions, one
for each compiler, really a good idea? Would porting to a stand-alone
assembler like nasm be worth the effort and extra (though optional)
dependency?)

Given those, I'd suggest posting patches to the mailing list. Ideally,
there'd be separate patches for the VC++ ports of the existing asm and
for your libogg modifications, as the latter have a much smaller chance
of actually being integrated into a release.

> Note: the cpu-consuming IDCT functions seem by their structure perfect
> candidate for a SSE therapy. Never done before?

I believe the vp32 sources contain a SSE2 implementation, but this has
not been forward-ported to Theora, to my knowledge. VP3HoSwiYo posted a
forward-port of vp32's MMX implementation to this mailing list (you can
search for it with Google). I don't believe it was ever officially
incorported into the theora-mmx branch.

You also might want to consider looking at the experimental decoder
(http://svn.xiph.org/experimental/derf/theora-exp/). This is where I've
been trying to focus future optimization efforts. It now sports some
(gcc-only) MMX optimizations thanks to Rudolf Marek, though notably not
for the iDCT or loop filter yet. But, it also has many algorithmic
optimizations, including a significant reduction in the number of calls
to oggpack_readB (by reading more than one bit at a time when possible).
In addition it supports a striped decode mode, which allows you to blit
decoded data to the display (and do color conversion or what have you)
as soon as it is available, while it's still in cache. It hasn't yet
been ported to libogg2 as the libogg2 API is not quite ready for a
public release yet, but I don't believe such a port would be difficult.

I think that covers all your options for further optimization. Obviously
which direction you go depends on your own schedule constraints and
project requirements.