[Theora-dev] Theora, MMX and optimisation

Mon Apr 11 21:41:34 PDT 2005

> > What should be
> > the best way to make my PC/Visua Studio/MMX/tweaked theora/ogg version
> > available to others?
> I haven't been following the MMX branch of the reference implementation
> in svn, but I would guess that at a minimum we'd want:
> 1) Confirmation that the output with your "tweaks" is bit-identical to
> the unpatched reference decoder,
Hey, I saw the _V_SELFTEST define, and ran the test, and pass it
(after a couple of fail/retr cycles :p )
> 2) Backports of the tweaks to GCC's AT&T-style assembly (as an aside to
> others, is maintaining two versions of all the optimized functions, one
> for each compiler, really a good idea? Would porting to a stand-alone
> assembler like nasm be worth the effort and extra (though optional)
> dependency?)
Nasm would be cool. I don't see my studio switching to GCC anytime
soon, at least for the next decade. And correct me if I'm wrong, but
there is no way to compile GCC inline from within VC? All studios that
make PC games run Visual Studio, and theora is potentially a bargain
for studio, given Blink or other codec licence fees.
Apart from that, I had to best guess how to convert the GCC assembly,
all is reversed and it include macro and other weird features you
don't get in Visual. Making the write back won't be easy for me. If
there is any pro-nasm poll, count me in.

> 
> Given those, I'd suggest posting patches to the mailing list. Ideally,
> there'd be separate patches for the VC++ ports of the existing asm and
> for your libogg modifications, as the latter have a much smaller chance
> of actually being integrated into a release.
> 
> > Note: the cpu-consuming IDCT functions seem by their structure perfect
> > candidate for a SSE therapy. Never done before?
> 
> VP3HoSwiYo posted a
> forward-port of vp32's MMX implementation to this mailing list (you can
> search for it with Google). I don't believe it was ever officially
> incorported into the theora-mmx branch.
Yup, found it (
http://lists.xiph.org/pipermail/theora-dev/2004-August/002242.html ),
unfortunatly the link is dead. Since I didn't know the list was so
reactive, I figured this was old news and had find another way to
reach my goals. So I guess someone still have this patch somewhere.
I'd be pleased to get a copy.

> 
> You also might want to consider looking at the experimental decoder
> (http://svn.xiph.org/experimental/derf/theora-exp/). This is where I've
> been trying to focus future optimization efforts. It now sports some
> (gcc-only) MMX optimizations thanks to Rudolf Marek, though notably not
> for the iDCT or loop filter yet. But, it also has many algorithmic
> optimizations, including a significant reduction in the number of calls
> to oggpack_readB (by reading more than one bit at a time when possible).
> In addition it supports a striped decode mode, which allows you to blit
> decoded data to the display (and do color conversion or what have you)
> as soon as it is available, while it's still in cache. It hasn't yet
> been ported to libogg2 as the libogg2 API is not quite ready for a
> public release yet, but I don't believe such a port would be difficult.
I don't see right now the point (for me at least) of getting datas
from the cache, I planned to make big buffers to prevent any lags.
But I feel this experimental version is what I should have started
from. Assembly optim can't beat algorithm's one.
I'm having a look at it, and the sample player code I used for my
class is now complaining about a missing  theora_state definition. I
had this one before, seems that this struct appeared in the lastest
version. Or it is just oc_theora_state renamed?
The lib compile fine but I gonna have to figure out how to re-inject
the asm code, since it obviously doesn't show up in the VC project.
> 
> I think that covers all your options for further optimization. Obviously
> which direction you go depends on your own schedule constraints and
> project requirements.
> 
Right, I don't know schedules that are not tight. Right know I'm gonna
try to get deeper in this experimental version.
I think I should bench this version against mine to see what is the
best. What the point of bringing a less good solution?

Thank for this lengthy and very informative answers.

Denpo