[speex-dev] [PATCH] Make SSE Run Time option.
Jean-Marc Valin
Jean-Marc.Valin at USherbrooke.ca
Thu Jan 15 18:37:00 PST 2004
Actually, I'm not denying you can do pretty fast complex multiplies by
separating real from imaginary. What I'm saying is that with addsubps,
you can do a better job when you have the complex numbers packed, then
you can do with SSE1 only. I still think AMD got it better with its
pfpnacc instruction and Intel should have gone much further.
<p>Le jeu 15/01/2004 à 19:28, Ian Ollmann a écrit :
> On Thu, 15 Jan 2004, Ian Ollmann wrote:
>
> > On Thu, 15 Jan 2004, Jean-Marc Valin wrote:
> >
> > > > Personally, I don't think much of PNI. The complex arithmetic stuff they
> > > > added sets you up for a lot of permute overhead that is inefficient --
> > > > especially on a processor that is already weak on permute. In my opinion,
> > >
> > > Actually, the new instructions make it possible to do complex multiplies
> > > without the need to permute and separate the add and subtract. The
> > > really useful instruction here is the "addsubps".
> >
> > Would you like to prove it with a code sample?
>
> I suppose if I make such a demand that it would only be sporting if I
> provide what I believe to be the more efficient competing method that uses
> only SSE/SSE2. Double precision is shown. For Single precision simply
> replace all "pd" with "ps" and "__m128d" with "__m128".
>
> //For C[] = A[] * B[]
> //The real and imaginary parts of A, B and C are stored in
> //different arrays, not interleaved
> inline void ComplexMultiply( __m128d *Cr, __m128d *Ci,
> __m128d Ar, __m128d Ai,
> __m128d Br, __m128d Bi )
> {
> // http://mathworld.wolfram.com/ComplexMultiplication.html
> // Cr = Ar * Br - Ai * Bi
> // Ci = Ai * Br + Ar * Bi
>
> __m128d real = _mm_mul_pd( Ar, Br );
> __m128d imag = _mm_mul_pd( Ai, Br );
>
> Ai = _mm_mul_pd( Ai, Bi );
> Ar = _mm_mul_pd( Ar, Bi );
>
> real = _mm_sub_pd( real, Ai );
> imag = _mm_add_pd( imag, Ar );
>
> *Cr = real;
> *Ci = imag;
> }
>
> No permute is required. The key thing to note is that I do two/four
> complex multiplies at a time in proper SIMD fashion, unlike PNI based
> methods. Thus, throughput is 3 vector ALU instructions per element, even
> though I do 6 ALU instructions. (1.5 insns/element for single precision.)
> Stores at the end are merely a formality required by C language
> architectures to return more than one result and will be presumably
> removed when the function is inlined.
>
> Ian
>
> ---------------------------------------------------
> Ian Ollmann, Ph.D. iano at cco.caltech.edu
> ---------------------------------------------------
>
> --- >8 ----
> List archives: http://www.xiph.org/archives/
> Ogg project homepage: http://www.xiph.org/ogg/
> To unsubscribe from this list, send a message to 'speex-dev-request at xiph.org'
> containing only the word 'unsubscribe' in the body. No subject is needed.
> Unsubscribe messages sent to the list will be ignored/filtered.
--
Jean-Marc Valin, M.Sc.A., ing. jr.
LABORIUS (http://www.gel.usherb.ca/laborius)
Université de Sherbrooke, Québec, Canada
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 190 bytes
Desc: Ceci est une partie de message numériquement signée.
Url : http://lists.xiph.org/pipermail/speex-dev/attachments/20040115/cee59e05/signature.pgp
More information about the Speex-dev
mailing list