[Vorbis-dev] Low level optimization

Thu Feb 10 13:16:26 PST 2005

Tuomo Latto wrote:

> Maybe they should complain to Microsoft (or Borland or ...) for not making
> compilers that would optimize this.  And to Intel too, for adding instructions
> that can't be used more easily (and for asking more money for it).
> They might as well complain to their retailer while they're at it, for
> advertising the benefits of said instructions, yet not telling people that
> getting the performance benefits requires extra effort from developers
> (=coding stuff in asm)...

i disagree.

gcc, microsoft, intel and pathscale compilers will generate SSE/SSE2 fp
code if you ask them to -- directly from C float/doubles.  for example
compiling with gcc -mfpmath=sse -msse2 is sufficient to get unvectorized
sse/sse2 -- which is frequently faster than x87 code.  the intel compiler
will vectorize when it can (and i think the same is true for pathscale).

furthermore, if you study the IA32 ISA
<http://www.intel.com/design/pentium4/manuals/index_new.htm> you'll
see that for every mmx/sse/2/3 instruction intel has defined various
"instrinsics".  these intrinsics are C function calls which access the
specified instruction.  you'll be happy to know that this same set of
intrinsics is supported across most x86 compilers.  that is to say:  you
don't have to write assembly, you need only write to the x86 intrinsics
and it should port across gcc, microsoft, intel and pathscale compilers.

for example:

	__m128 _mm_add_ps(__m128 a, __m128 b);

becomes an addps (packed singles).

the support for the intrinsics has had some bugs in earlier revs of gcc, 
but gcc-3.4 seems to do pretty well for me.

one hassle i've found is that i need to conditionalize the #include file 
to get all of the intrinsics:

#ifdef __INTEL_COMPILER
#include <emmintrin.h>
#else
#include <xmmintrin.h>
#endif

gcc includes an emmintrin.h but it's only the mmx intrinsics, you need 
xmmintrin.h to get the sse/2/3 intrinsics.  they're all in one header file 
for icc.  i don't know the msft situation.

(xmmintrin.h is part of the gcc-specific include directory, try "gcc
-v" and it'll show you where its specs file is -- such as
/usr/lib/gcc-lib/i486-linux/3.3.5/specs ... look in
/usr/lib/gcc-lib/i486-linux/3.3.5/include/xmmintrin.h to see the
intrinsics.)

> > Seriously though, using asm would probably reduce portability.
> > GCC (=cygwin, mingw, ..?) uses AT&T syntax.

the x86 assembler of choice for portable assembly is nasm.  it is
open source, well maintained, cross-platform, and generates object
files compatible with microsoft and gcc compilers.  you'll find it
used in various packages requiring portable windows/unix assembly
(i.e. mjpegtools).  there are other methods available as well -- such
as the perl wrapped assembly in openssl.

On Thu, 10 Feb 2005, Aleksey wrote:

> You probably right. Thank you and thanks to all for attention.

no, please don't let them deter you from optimizing the codecs.

while it might be a tiny difference to a high end 4GHz power hungry
processor that's going to burn 100W no matter what you do with it, the
optimizations will pay off on low power devices.  even on the high end
processors the optimizations will pay off when someone needs better than
real-time encoding, such as when importing lots of new music.

i'd suggest giving the intrinsics a try -- my guess is that you might
find some gcc bugs, you'll definitely find gcc inefficiencies, but
in the process of doing this you can submit the examples to the gcc
folks and improve two open source projects at the same time.

if you want an example of the use of intrinsics, see my sse2 sha1/256
code <http://arctic.org/~dean/crypto/sha-sse2-20041218.txt>.

-dean