[theora-dev] Proposal for replacing asm code with intrinsics

j at v2v.cc j at v2v.cc
Tue Oct 13 06:59:26 PDT 2009

Sukhomlinov, Vadim wrote:
> Hi, 
> I'm new to Theora and would like to propose several performance optimization using advanced instructions in x86 CPUs (SSE2-SSE4.2).
> There are several source files in \x86 and \x86_vc which developed using inline assembler. However this cause several maintenance problems:
> 1) Need to sync gcc & msvc versions
> 2) Only 32bit environment is supported
> 3) No support for newer than MMX instruction sets
> My proposal is to replace all functions in assembly with compiler intrinsic which compiles into 1-2 assembly instructions and are much easier to maintain.
> For example:
> _mm_sad_epu8(__m128, __m128) will be compiled in PSADBW instruction with compiler-allocated registers.
> And code like:
>     psadbw mm4,mm5
>     paddw mm0,mm4
> Can be re-written into 
> _m64 mm0, mm4, mm5, mm6, mm7; //of course using meaningful names
> mm0= _mm_add_epi16(mm0, _mm_sad_pu8(mm4, mm5)); 
> Compiler will replace variables with actual registers, ensuring better allocation and scheduling of them.
> So, benefits are:
> 1) Easier to read & understand code which can use same variable names as generic version in C
> 2) Single source code for gcc & msvc & intel compiler (all of them supports same syntax)
> 3) Easier migration to SSE2 (which can handle 128bit vs. 64 as with MMX) thru replacement of __m64 to __m128
> 4) 64-bit code generation support
> 5) Compiler can reschedule instructions based on target CPU to deliver better performance w/o manual tuning. I did several tests with high-quality manually optimized assembly in the past and then replaced it to intrinsics which resulted in 3-5% better performance when using Intel compiler. Anyway, I don't expect any performance issues with it.
> It will require some change in project structure and makefiles and I'm not sure if this ok - at least I don't know how to coordinate work on Theora with over developers. Could you please help me here?

just some notes, current code works on 64bit, at least the gcc version,
not sure about msvc. there was an attempt to use intrinsics some time
ago but it was slower compared to the asm version(with gcc). do you
think your intrinsic version will be same speed or faster or do you
expect it to be slower? gcc would be the important compiler here, if its
only faster with Intel compilers is a regression for most common uses.
for linux distributions it would still be required to detect the cpu at
runtime and use the fastest implementation for the current cpu, compile
time optimization along is not enough.


More information about the theora-dev mailing list