[theora-dev] MMX and extended-MMX acceleration patch for encoding

Thu May 8 16:33:57 PDT 2003

Hello,

attached is a gzipped patch file to the lib/mcomp.c source file of theora
(as of AnonCVS current version) that implements MMX and extended-MMX
optimizations in the most frequently used functions of the encoder (as
shown by gprof).

This is more a proof of concept than a real request for inclusion into the
source tree. My personal intent was more to look deeper into the MMX
instruction set and/or GCC and/or Theora than a real need for performance
improvements. :-) Plus the fact that, apparently, I still have
difficulties with the mathematics of video compression and could not 
do more than grunt work on this kind of code... :-))
 Of course, some of you may find this interesting, so I just want to share
the results.

I have introduced 4-5 new (inline) functions that corresponds to the core
compute intensive operations of the encoder (as found experimentally by
gprof). These are in fact wrappers to allow switching between
implementation variants and I have implemented several variants: C, MMX
assembly and MMXEXT assembly (something like SSE maybe, recent extensions
apparently, found in PIII and Athlon).
 Preprocessor directive HAVE_MMX and HAVE_MMXEXT allow to select at
compile-time which code gets used for real. So, use CFLAGS="-DHAVE_MMX" to
get the MMX implementation, and CFLAGS="-DHAVE_MMX -DHAVE_MMXEXT" to get
the MMXEXT implementation (which uses both MMX and extMMX instructions).
 The wrappers also allow back-to-back testing of the C and assembly
implementations (very useful for testing). Use -DTEST_MMX for this code to
get in (in this case, both variants are called each time, so do not expect
performance improvements when doing double work...:-).

I have observed between 10% and 30% improvements in encoding speed using
these assembly implementations. The MMXEXT implementations offer the most
impressive improvements (on PIII or Athlon CPU some functions like sum of
absolute difference can be done via a single extMMX instruction), but MMX
too show improvements.
 Globally, I'd say that one could expect 15% improvement, but this should
be assessed with longer testing, and different testsets. For testing, I
have only used the test files published on theora.org web site some time
ago. I include at the end of this mail some time measurements on various
computers, using this testset.
 I do not know the impact of these modifications on the player.

Note too that these implementations should also probably be validated with
respect to accuracy. I have tried very hard not to introduce any
arithmetic error when using assembly but, in some cases, C-based and
MMX-based results differ (e.g. integer average of two values via the MMX
instruction set adds a 1 to the intermediate result before division by 2),
so encoding the same data via C or MMX functions does not produce the same
ogg file. But both file seems correct visually, at least from what I saw,
and they do not differ significantly in size.

Maybe others on the list would like to run these optimizations on bigger
testsets and compare C-based and assembly-based variants with more
numerical techniques to finally assess the performance improvement and
validate my code.
 All in all, it seems to me that it was worth the effort. (And also that
one should not do such kind of efforts too often.), but feel free to
disagree.

Final note, I used __asm__() GCC assembly directives, so the code should
compile easily with many versions of GCC (I used 2.95 I think, the default
one on Debian 3.0 in fact). [Btw, note, you need to use some level of
optimizations for GCC (I used the default ones) for actually inlining the
inline functions I added and not getting a penalty...] Recently GCC 3.2
introduced new compiler builtins for MMX and vector operations. I have not
used them because GCC 3.2 is very recent (and I have not yet it
installed), but I looked at them and I think my implementation should be
easy to translate to use the builtins instead of inline assembly, in 6 or
12 months (and then, maybe GCC will give us better loop unroling, register
allocation and additional perf. or simpler code. Maybe...).

Do not hesitate to react and give impressions, see you,

Rodolphe

<p>Some results (for the test file published on Theora site):
============
* Normal quality
Athlon XP 2200+: MMX-ext optimization
real    0m2.483s
user    0m2.450s
sys     0m0.030s

Athlon XP 2200+: MMX optimization
real    0m3.075s
user    0m3.020s
sys     0m0.050s

Athlon XP 2200+: No optimization
real    0m3.524s
user    0m3.490s
sys     0m0.040s

* High quality (-v 9)
Athlon XP 2200+: MMX-ext optimization
real    0m3.155s
user    0m3.090s
sys     0m0.070s

Athlon XP 2200+: MMX optimization
real    0m4.316s
user    0m4.260s
sys     0m0.050s

Athlon XP 2200+: No optimization
real    0m5.131s
user    0m5.080s
sys     0m0.060s

=======================================
* Normal quality (no opt)
K6-3 450MHz: MMX optimization
real    0m13.880s
user    0m13.590s
sys     0m0.210s

K6-3 450MHz: No opt
real    0m17.418s
user    0m16.850s
sys     0m0.240s

*  High quality (-v 9)
K6-3 450MHz: MMX optimization
real    0m17.810s
user    0m17.360s
sys     0m0.270s

K6-3 450MHz: No opt
real    0m23.945s
user    0m23.510s
sys     0m0.240s

* Highest quality (-v 10)
K6-3 450MHz: MMX optimization
real    0m18.100s
user    0m17.850s
sys     0m0.130s

K6-3 450MHz: No opt
real    0m24.082s
user    0m23.590s
sys     0m0.230s
=================================
* Normal quality (no opt)
PIII 800MHz: MMX-ext optimization
real    0m7.741s
user    0m6.410s
sys     0m0.120s

PIII 800MHz: MMX optimization
real    0m8.618s
user    0m7.230s
sys     0m0.100s

PIII 800MHz: No opt
real    0m9.645s
user    0m8.020s
sys     0m0.150s

-------------- next part --------------
A non-text attachment was scrubbed...
Name: mmx-mcomp.patch.gz
Type: application/octet-stream
Size: 5111 bytes
Desc: GZIPed patch file
Url : http://lists.xiph.org/pipermail/theora-dev/attachments/20030509/01908450/mmx-mcomp.patch.obj