[theora-dev] Proposal for replacing asm code with intrinsics
n.pipenbrinck at cubic.org
Tue Oct 13 11:52:06 PDT 2009
Sukhomlinov, Vadim wrote:
> I'm new to Theora and would like to propose several performance optimization using advanced instructions in x86 CPUs (SSE2-SSE4.2).
> There are several source files in \x86 and \x86_vc which developed using inline assembler. However this cause several maintenance problems:
> 1) Need to sync gcc & msvc versions
> 2) Only 32bit environment is supported
> 3) No support for newer than MMX instruction sets
I've done tests on VS.net and GCC half a year ago when we encounterd a
strange code-generation bug in the assember-code with for the win32
release builds of firefox (anyone remembers ?)
As far as I remember I've used GCC 4.2.something for testing.
The performance will drop was about 10 to 15%.
Imho the wins for maintainability alone are worth it. If the code gets
rewritten for SSE I'd expect no performance loss and with a bit of luck
even a tiny performance win due to the wider registers.
Btw - the reasons why the intrinsics have been slower than the
hand-written codes are:
* The assembler-code is hand scheduled and the loops have been (mostly)
written with modulo-scheduling in mind (something the GCC can
unfortunately only do in theory).
* For some reason the intrinsics generate sub-optimal code. I've seen
plenty of useless register moves and spills to memory.
* Also it seems like GCC has no idea how to schedule any intrinsics. It
looks like the input and output registers are ignored and GCC just
converts the SSA tree to raw code without moving instructions around.
Back when I've written the assembler code, moving the processing of data
as far away as possible from the memory accesses made the biggest
difference because it masked the cache-misses.
I still would prefer SSE intrinsics.. That wouldn't only make
maintainability easier but also allows much easy porting to ARM-NEON,
PPC Altivec and MIPS-MDMX.
More information about the theora-dev