[speex-dev] [PATCH] Make SSE Run Time option. Add Win32 SSE code

Jean-Marc Valin Jean-Marc.Valin at USherbrooke.ca
Tue Jan 13 23:44:59 PST 2004


> In the Atholon XP 2400+ that we have in our QA lab (Win2000 ) if you run 
> that code it generates an Illegal Instruction Error. In addition, an AMD 
> Duron (Windows ME) does the same thing. There are two possible reasons - 
> One is that those processors do not support xmm registers or the Operating 
> System does not support XMM registers. In the morning we will check the 
> code on Windows XP. This may be a Windows specific thing, either way you 
> still need to support non FP versions of the SSE set.

Most likely, you have on OS problem. I have yet to find code that runs
on a Pentium III and doesn't run on an Athlon XP.

> If you read through AMD's processor detection guide
>          (PDF) 
> http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/20734.pdf
> 
> and go to section that shows the sample code for dealing with CPUID 
> support. (Starts about Page 37) It talks about the FEATURE_SSEFP support 
> which you have to query for. On the Atholon XP 2400+ that we have here, 
> that code does not detect the presence of that when run under Windows. The 
> same code on a Pentium 4 detects it just fine.

OK, I have gone though the doc and I think I understand. What they call
plain "SSE" (no FP) is actually the (very) incomplete SSE implementation
they had in the Classic and T-Bird Athlon. What they call SSEFP is
actually what Intel calls "SSE" or "SSE1". Only the Athlon XP (and
newer) CPU implements all of SSE1.

Now about supporting what you call the "non FP version" (AMD's
incomplete implementation), I say it's not worth it. There's no gain
because all this provides is prefetch functions which are going to be
useless for Speex because everything fits in the L1 anyway. Now if you
really want to do something about AMD processors (mainly pre-XP
Athlons), a 3DNow! implementation would give you a great speedup
(probably even better than SSE).

> Here is an article which describes the K8 (Opteron and Atholon64) as 
> including the XMM registers: 
> http://sysopt.earthweb.com/articles/k8/index2.html . All the stuff I could 
> google seems to indicate that XMM register support is not included in the 
> current Atholon XP series or below.

Believe me, it is. I can even tell you that the floating point SSE
implementation in the Athlon XP is faster than that of the Pentium III. 

> With any machine you are not guaranteed to get support for the XMM 
> registers (the 128 bit wide ones), since the OS has to support it as well.

True. With Linux, you need at least 2.4. With NT you need a service
pack, don't know about Win2k and XP.

> Have you or anybody else successfully run the current SSE code on a Atholon 
> XP system?

I have, many times.

> Agreed, although the inner_prod isn't that big a deal since you can do 
> clever vector swaps in Altivec to reduce the amount of shuffling needed. In 
> our current Altivec version we have four blocks, dealing with when certain 
> things are aligned and certain things aren't. Its ugly to read, but works 
> quite nicely.

Do you already have that implemented? I know it's possible, but the code
will likely be really ugly.

> For the alignment part, my feeling is that the compiler generated way is 
> better than a run-time cast. The compiler native code will not cross 
> platform should generate much faster code since you don't have to perform 
> the cast at run-time, which is what your ALIGN macros appear to be doing in 
> stack-alloc.h.

It's not really a "run-time cast" (at least not like C++ casts). The
compiler will just generate an "add" and an "and" and that's all.

> One other thing we noticed is that you tend to do a lot of  for loop based 
> copies:
...
> Do you not like to use memcpy or memset? Or am I missing something like 
> overlapping memory spaces?

I just felt it wasn't worth it. I've been trying to minimize all
dependencies, including on libc. You'll see that the only file that uses
libc is misc.c so it's pretty easy for someone to even remove that
dependency. Using memcpy/memset in this context would create more
trouble than it would solve. Believe me, the copies change nothing in
terms of CPU time anyway.

        Jean-Marc


-- 
Jean-Marc Valin, M.Sc.A., ing. jr.
LABORIUS (http://www.gel.usherb.ca/laborius)
Université de Sherbrooke, Québec, Canada


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 190 bytes
Desc: Ceci est une partie de message numériquement signée.
Url : http://lists.xiph.org/pipermail/speex-dev/attachments/20040114/bc6dfdc4/signature.pgp


More information about the Speex-dev mailing list