[speex-dev] [PATCH] Make SSE Run Time option.

Ian Ollmann iano at cco.caltech.edu
Thu Jan 15 16:28:15 PST 2004



On Thu, 15 Jan 2004, Ian Ollmann wrote:

> On Thu, 15 Jan 2004, Jean-Marc Valin wrote:
>
> > > Personally, I don't think much of PNI. The complex arithmetic stuff they
> > > added sets you up for a lot of permute overhead that is inefficient --
> > > especially on a processor that is already weak on permute. In my opinion,
> >
> > Actually, the new instructions make it possible to do complex multiplies
> > without the need to permute and separate the add and subtract. The
> > really useful instruction here is the "addsubps".
>
> Would you like to prove it with a code sample?

I suppose if I make such a demand that it would only be sporting if I
provide what I believe to be the more efficient competing method that uses
only SSE/SSE2.  Double precision is shown. For Single precision simply
replace all "pd"  with "ps" and "__m128d" with "__m128".

        //For C[] = A[] * B[]
        //The real and imaginary parts of A, B and C are stored in
        //different arrays, not interleaved
        inline void ComplexMultiply( 	__m128d *Cr, __m128d *Ci,
                                        __m128d Ar, __m128d Ai,
                                        __m128d Br, __m128d Bi )
        {
                // http://mathworld.wolfram.com/ComplexMultiplication.html
                // Cr = Ar * Br - Ai * Bi
                // Ci = Ai * Br + Ar * Bi

                __m128d real = _mm_mul_pd( Ar, Br );
                __m128d imag = _mm_mul_pd( Ai, Br );

                Ai = _mm_mul_pd( Ai, Bi );
                Ar = _mm_mul_pd( Ar, Bi );

                real = _mm_sub_pd( real, Ai );
                imag = _mm_add_pd( imag, Ar );

                *Cr = real;
                *Ci = imag;
        }

No permute is required. The key thing to note is that I do two/four
complex multiplies at a time in proper SIMD fashion, unlike PNI based
methods.  Thus, throughput is 3 vector ALU instructions per element, even
though I do 6 ALU instructions.  (1.5 insns/element for single precision.)
Stores at the end are merely a formality required by C language
architectures to return more than one result and will be presumably
removed when the function is inlined.

Ian

---------------------------------------------------
   Ian Ollmann, Ph.D.       iano at cco.caltech.edu
---------------------------------------------------

--- >8 ----
List archives:  http://www.xiph.org/archives/
Ogg project homepage: http://www.xiph.org/ogg/
To unsubscribe from this list, send a message to 'speex-dev-request at xiph.org'
containing only the word 'unsubscribe' in the body.  No subject is needed.
Unsubscribe messages sent to the list will be ignored/filtered.



More information about the Speex-dev mailing list