[speex-dev] [PATCH] Make SSE Run Time option.

Thu Jan 15 18:37:00 PST 2004

Actually, I'm not denying you can do pretty fast complex multiplies by
separating real from imaginary. What I'm saying is that with addsubps,
you can do a better job when you have the complex numbers packed, then
you can do with SSE1 only. I still think AMD got it better with its
pfpnacc instruction and Intel should have gone much further. 

<p>Le jeu 15/01/2004 à 19:28, Ian Ollmann a écrit :
> On Thu, 15 Jan 2004, Ian Ollmann wrote:
> 
> > On Thu, 15 Jan 2004, Jean-Marc Valin wrote:
> >
> > > > Personally, I don't think much of PNI. The complex arithmetic stuff they
> > > > added sets you up for a lot of permute overhead that is inefficient --
> > > > especially on a processor that is already weak on permute. In my opinion,
> > >
> > > Actually, the new instructions make it possible to do complex multiplies
> > > without the need to permute and separate the add and subtract. The
> > > really useful instruction here is the "addsubps".
> >
> > Would you like to prove it with a code sample?
> 
> I suppose if I make such a demand that it would only be sporting if I
> provide what I believe to be the more efficient competing method that uses
> only SSE/SSE2.  Double precision is shown. For Single precision simply
> replace all "pd"  with "ps" and "__m128d" with "__m128".
> 
> 	//For C[] = A[] * B[]
> 	//The real and imaginary parts of A, B and C are stored in
> 	//different arrays, not interleaved
> 	inline void ComplexMultiply( 	__m128d *Cr, __m128d *Ci,
> 					__m128d Ar, __m128d Ai,
> 					__m128d Br, __m128d Bi )
> 	{
> 		// http://mathworld.wolfram.com/ComplexMultiplication.html
> 		// Cr = Ar * Br - Ai * Bi
> 		// Ci = Ai * Br + Ar * Bi
> 
> 		__m128d real = _mm_mul_pd( Ar, Br );
> 		__m128d imag = _mm_mul_pd( Ai, Br );
> 
> 		Ai = _mm_mul_pd( Ai, Bi );
> 		Ar = _mm_mul_pd( Ar, Bi );
> 
> 		real = _mm_sub_pd( real, Ai );
> 		imag = _mm_add_pd( imag, Ar );
> 
> 		*Cr = real;
> 		*Ci = imag;
> 	}
> 
> No permute is required. The key thing to note is that I do two/four
> complex multiplies at a time in proper SIMD fashion, unlike PNI based
> methods.  Thus, throughput is 3 vector ALU instructions per element, even
> though I do 6 ALU instructions.  (1.5 insns/element for single precision.)
> Stores at the end are merely a formality required by C language
> architectures to return more than one result and will be presumably
> removed when the function is inlined.
> 
> Ian
> 
> ---------------------------------------------------
>    Ian Ollmann, Ph.D.       iano at cco.caltech.edu
> ---------------------------------------------------
> 
> --- >8 ----
> List archives:  http://www.xiph.org/archives/
> Ogg project homepage: http://www.xiph.org/ogg/
> To unsubscribe from this list, send a message to 'speex-dev-request at xiph.org'
> containing only the word 'unsubscribe' in the body.  No subject is needed.
> Unsubscribe messages sent to the list will be ignored/filtered.

-- 
Jean-Marc Valin, M.Sc.A., ing. jr.
LABORIUS (http://www.gel.usherb.ca/laborius)
Université de Sherbrooke, Québec, Canada

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 190 bytes
Desc: Ceci est une partie de message numériquement signée.
Url : http://lists.xiph.org/pipermail/speex-dev/attachments/20040115/cee59e05/signature.pgp