[theora-dev] Proposal for replacing asm code with intrinsics
vadim.sukhomlinov at intel.com
Tue Oct 13 07:53:54 PDT 2009
Yes, for gcc it's easier to compile same assembly on x86_64 & x86 due to similar syntax, and there is even 64-bit specific code in sse2fdct.c. For MSVC 64-bit support is missing and C versions are used instead. As sse2fdct doesn't exist for 32bit, I assume there were some performance analysis before and 32bit version doesn't show benefits of SSE2 vs. MMX which can be insufficient number of registers and temporary data had to be stored in memory.
Regarding performance of gcc generated intrinsic vs. inline assembly - not sure what can be root cause. What was gcc versions used for these tests? And what are gcc versions which performance is most important? I know there were lots of improvements moving from gcc 4.1->4.3->4.4... I'll try to run test cases to check that. Btw, do you have any reference performance benchmark for Theora which I can use in experiments?
It's still possible to have runtime selection of implementation.
From: j at v2v.cc [mailto:j at v2v.cc]
Sent: Tuesday, October 13, 2009 5:59 PM
To: Sukhomlinov, Vadim
Cc: theora-dev at xiph.org
Subject: Re: [theora-dev] Proposal for replacing asm code with intrinsics
Sukhomlinov, Vadim wrote:
> I'm new to Theora and would like to propose several performance optimization using advanced instructions in x86 CPUs (SSE2-SSE4.2).
> There are several source files in \x86 and \x86_vc which developed using inline assembler. However this cause several maintenance problems:
> 1) Need to sync gcc & msvc versions
> 2) Only 32bit environment is supported
> 3) No support for newer than MMX instruction sets
> My proposal is to replace all functions in assembly with compiler intrinsic which compiles into 1-2 assembly instructions and are much easier to maintain.
> For example:
> _mm_sad_epu8(__m128, __m128) will be compiled in PSADBW instruction with compiler-allocated registers.
> And code like:
> psadbw mm4,mm5
> paddw mm0,mm4
> Can be re-written into
> _m64 mm0, mm4, mm5, mm6, mm7; //of course using meaningful names
> mm0= _mm_add_epi16(mm0, _mm_sad_pu8(mm4, mm5));
> Compiler will replace variables with actual registers, ensuring better allocation and scheduling of them.
> So, benefits are:
> 1) Easier to read & understand code which can use same variable names as generic version in C
> 2) Single source code for gcc & msvc & intel compiler (all of them supports same syntax)
> 3) Easier migration to SSE2 (which can handle 128bit vs. 64 as with MMX) thru replacement of __m64 to __m128
> 4) 64-bit code generation support
> 5) Compiler can reschedule instructions based on target CPU to deliver better performance w/o manual tuning. I did several tests with high-quality manually optimized assembly in the past and then replaced it to intrinsics which resulted in 3-5% better performance when using Intel compiler. Anyway, I don't expect any performance issues with it.
> It will require some change in project structure and makefiles and I'm not sure if this ok - at least I don't know how to coordinate work on Theora with over developers. Could you please help me here?
just some notes, current code works on 64bit, at least the gcc version,
not sure about msvc. there was an attempt to use intrinsics some time
ago but it was slower compared to the asm version(with gcc). do you
think your intrinsic version will be same speed or faster or do you
expect it to be slower? gcc would be the important compiler here, if its
only faster with Intel compilers is a regression for most common uses.
for linux distributions it would still be required to detect the cpu at
runtime and use the fastest implementation for the current cpu, compile
time optimization along is not enough.
Closed Joint Stock Company Intel A/O
Registered legal address: Krylatsky Hills Business Park,
17 Krylatskaya Str., Bldg 4, Moscow 121614,
This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.
More information about the theora-dev