[speex-dev] [PATCH] Make SSE Run Time option.

Thu Jan 15 01:09:55 PST 2004

> We agree on not supporting the non-FP version, however the run time flags 
> need to be settable with a non FP SSE mode so that exceptions are avoided.

I think we should keep the more "official" naming and not AMD's, which
is more confusing. SSE means SSE1: all the SSE instructions (including
the ones using xmm registers). What AMD calls SSE is not SSE at all.
Basically, it's a bunch of "extra instructions" borrowed from SSE and
that are part of the extended 3DNow!.

> I thus propose a set of defines like this instead of the ones in our 
> initial patch:
> 
> #define CPU_MODE_NONE     0
> #define CPU_MODE_MMX      1   // Base Intel MMX x86
> #define CPU_MODE_3DNOW    2 // Base AMD 3Dnow extensions
> #define CPU_MODE_SSE      4 // Intel Integer SSE instructions
> #define CPU_MODE_3DNOWEXT 8 // AMD 3Dnow extended instructions
> #define CPU_MODE_SSEFP 16 // SSE FP modes, mainly support for xmm registers
> #define CPU_MODE_SSE2     32 // Intel SSE2 instructions
> #define CPU_MODE_ALTIVEC  64 // PowerPC Altivec support.

If you reall want to define stuff like that, you could have simply
NONE
MMX
3DNOW
3DNOWEXT
SSE1
SSE2
ALTIVEC

Even then, MMX is completely useless for Speex IMO and I doubt it's
worth writing 3DNow non-ext code (or even 3DNow! at all). Same for SSE2:
Speex simply doesn't use doubles at all. That's why i think only
defining NONE, SSE and ALTIVEC (maybe 3DNow?) would be enough.

> We already have it implemented for the inner_prod function. After it is 
> stable and fully tested, we will send you a patch. If you have never done 
> Altivec coding it is quite simple since it is all C Macro's / functions. 
> Not nearly as nasty as inline asm code, although the 16 byte alignment 
> issues can be quite a pain. Our current working code is below:

You can do the same with SSE intrinsics. I just got used to writing
assembly before they were available for gcc. I had a quick look at your
inner_prod implementation. I think that if you really want to make that
fast (there's a big possible gain there), you need to consider the
optimization at a higher level: from open_loop_nbest_pitch. The function
calls inner_prod for a continuous range of offsets. With that in mind,
it would probably be simpler to just take 4 copies (with different
offsets) of one of the vectors and then compute everything with simple,
aligned loads.

        Jean-Marc

-- 
Jean-Marc Valin, M.Sc.A., ing. jr.
LABORIUS (http://www.gel.usherb.ca/laborius)
Université de Sherbrooke, Québec, Canada

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 190 bytes
Desc: Ceci est une partie de message numériquement signée.
Url : http://lists.xiph.org/pipermail/speex-dev/attachments/20040115/5c402302/signature.pgp