[opus] ASM runtime detection and optimizations

Timothy B. Terriberry tterribe at xiph.org
Thu May 23 11:34:49 PDT 2013


Aurélien Zanelli wrote:
> I wrote a proof of concept regarding the cpu capabilities runtime
> detection and choice of optimized function. I follow design which had
> been discussed on IRC.

This is a good start. Review comments inline marked with [TBT].

> Also, i notice a little drawback: we must propagate the arch index
> through functions which don't have codec state as argument.

Yeah, that's pretty much unavoidable. I don't think the overhead will be 
high, though.



-            celt_fir(exc+MAX_PERIOD-exc_length, lpc+c*LPC_ORDER,
+            celt_fir[st->arch&OPUS_ARCHMASK](exc+MAX_PERIOD-exc_length, 
lpc+c*LPC_ORDER,
                    exc+MAX_PERIOD-exc_length, exc_length, LPC_ORDER, 
lpc_mem);

[TBT] I think this should be hidden in a macro, e.g.,

In celt_lpc.h:

/*Do we have run-time CPU detection?*/
#if OPUS_HAVE_RTCD

extern void (*const CELT_FIR_IMPL[OPUS_ARCHMASK+1])(const opus_val16 *,
  const opus_val16 *,opus_val16 *,int, int, opus_val16 *);

# define celt_fir(x, num, y, N, ord, mem, arch) \
  ((*CELT_FIR_IMPL[(arch)&OPUS_ARCHMASK])(x, num, y, N, ord, mem))

#else

# define celt_fir(x, num, y, N, ord, mem, arch) \
  (celt_fir_c(x, num, y, N, ord, mem))

#endif

Then, we can call celt_fir() like a normal function with an additional 
arch parameter.

That lets us hide the NEON-specific stuff in celt_lpc_neon.h, and 
override the define to hard-code use of the NEON version if we know 
we're compiling for a specific CPU, etc.


--- a/celt/celt_lpc.c
+++ b/celt/celt_lpc.c
@@ -32,9 +32,21 @@
  #include "celt_lpc.h"
  #include "stack_alloc.h"
  #include "mathops.h"
+#include "cpu_support.h"

-#ifdef ARM_HAVE_NEON
+#ifdef ARM_ASM
  #include "celt_lpc_neon.h"

[TBT] I think this include should happen in celt_lpc.h.

[TBT] I think the definition of this array should be in, e.g., a 
celt_lpc_arm.c under celt/arm.

+void (* const celt_fir[OPUS_ARCHMASK+1])(const opus_val16 *, const 
opus_val16 *,
+		opus_val16 *, int, int, opus_val16 *) = {
+  celt_fir_c,    //C

[TBT] We're assuming at least ARMv4 if we have any ARM asm enabled at 
all, so you don't need a C version. That gets us down to 4 variants on 
ARM, so there's no wasted array entries.

+  celt_fir_c,    //ARMV4
+  celt_fir_c,    //ARMv5E
+  celt_fir_c,    //ARMv6
+  celt_fir_neon  //NEON
+};
+#else

+void (* const celt_fir[OPUS_ARCHMASK+1])(const opus_val16 *, const 
opus_val16 *,
+		opus_val16 *, int, int, opus_val16 *) = {celt_fir_c};

[TBT] If we don't have ARM asm, we shouldn't be using RTCD, so (using 
the #ifdefs I proposed above), we wouldn't need this.

  #endif

  void _celt_lpc(
@@ -91,8 +103,7 @@ int          p
  #endif
  }

-#ifndef OVERRIDE_CELT_FIR
-void celt_fir(const opus_val16 *x,
+void celt_fir_c(const opus_val16 *x,
           const opus_val16 *num,
           opus_val16 *y,
           int N,
@@ -116,7 +127,6 @@ void celt_fir(const opus_val16 *x,
        y[i] = ROUND16(sum, SIG_SHIFT);
     }
  }
-#endif

[TBT] BTW, I think it would be better to rebase these patches instead of 
undoing a bunch of things you did in patch 02.


[TBT] celt_lpc_neon.* should go under celt/arm now.

diff --git a/celt/celt_lpc_neon.h b/celt/celt_lpc_neon.h
index e9f76c6..029ae7b 100644
--- a/celt/celt_lpc_neon.h
+++ b/celt/celt_lpc_neon.h


[TBT] See above.

-   celt_fir(x_lp, lpc, x_lp, len>>1, 4, mem);
+   celt_fir[arch&OPUS_ARCHMASK](x_lp, lpc, x_lp, len>>1, 4, mem);

     mem[0]=0;
     lpc[0]=QCONST16(.8f,12);
-   celt_fir(x_lp, lpc, x_lp, len>>1, 1, mem);
+   celt_fir[arch&OPUS_ARCHMASK](x_lp, lpc, x_lp, len>>1, 1, mem);

  }


diff --git a/configure.ac b/configure.ac
index ee6df9a..6b9612f 100644
--- a/configure.ac
+++ b/configure.ac
@@ -167,6 +167,7 @@ if test "x${ac_enable_asm}" = xyes ; then
          AS_GCC_INLINE_ASSEMBLY([asm_optimization="ARM"],
              [asm_optimization="disabled"])
          if test "x${asm_optimization}" = "xARM" ; then
+            AC_DEFINE([ARM_ASM], 1, [Use generic ARM asm optimizations])

[TBT] I don't we should distinguish between "generic ARM" and ARMv4, 
because we don't have any plans to support earlier ARM versions. If you 
want to rename ARMv4_ASM to ARM_ASM, that's okay, but my opinion was 
that calling it ARMv4_ASM and not having an ARM_ASM made that assumption 
unambiguous. Earlier devices (if any) can just disable ARM asm entirely.

              AC_DEFINE([ARMv4_ASM], 1, [Use generic ARMv4 asm 
optimizations])
              AS_ASM_ARM_EDSP([ARMv5E_ASM=1],[ARMv5E_ASM=0])
              if test "x${ARMv5E_ASM}" = "x1" ; then


diff --git a/src/armcpu.c b/src/armcpu.c
new file mode 100644
index 0000000..10a2905
--- /dev/null
+++ b/src/armcpu.c
@@ -0,0 +1,160 @@
+/* Copyright (c) 2010 Xiph.Org Foundation
+ * Copyright (c) 2013 Parrot */
+/*
+   Redistribution and use in source and binary forms, with or without
+   modification, are permitted provided that the following conditions
+   are met:
+
+   - Redistributions of source code must retain the above copyright
+   notice, this list of conditions and the following disclaimer.
+
+   - Redistributions in binary form must reproduce the above copyright
+   notice, this list of conditions and the following disclaimer in the
+   documentation and/or other materials provided with the distribution.
+
+   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+   ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT 
OWNER
+   OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+   EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+   PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+   PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
+   LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
+   NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+   SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+*/
+
+/* Original code from libtheora modified to suit to Opus */
+
+#ifdef HAVE_CONFIG_H
+#include "config.h"
+#endif
+
+#include "armcpu.h"
+
+#if !defined(ARM_ASM) || \
+    !defined(ARMv5E_ASM) && !defined(ARMv6_ASM) && \
+    !defined(ARM_HAVE_NEON)

[TBT] Please don't abbreviate "capabilities".

+opus_uint32 opus_cpu_capa(void)
+{
+  return 0;
+}
+#elif defined(_MSC_VER)
+/*For GetExceptionCode() and EXCEPTION_ILLEGAL_INSTRUCTION.*/
+# define WIN32_LEAN_AND_MEAN
+# define WIN32_EXTRA_LEAN
+# include <windows.h>
+
+opus_uint32 opus_cpu_capa(void){
+  opus_uint32 flags;
+  flags=0;
+  /*MSVC has no inline __asm support for ARM, but it does let you __emit
+   *      instructions via their assembled hex code.
+   *          All of these instructions should be essentially nops.*/
+# if defined(ARMv5E_ASM)
+  __try{
+    /*PLD [r13]*/
+    __emit(0xF5DDF000);
+    flags|=OPUS_CPU_ARM_EDSP;
+  }
+  __except(GetExceptionCode()==EXCEPTION_ILLEGAL_INSTRUCTION){
+    /*Ignore exception.*/
+  }
+#  if defined(ARMv6E_ASM)
+  __try{
+    /*SHADD8 r3,r3,r3*/
+    __emit(0xE6333F93);
+    flags|=OPUS_CPU_ARM_MEDIA;
+  }
+  __except(GetExceptionCode()==EXCEPTION_ILLEGAL_INSTRUCTION){
+    /*Ignore exception.*/
+  }
+#   if defined(ARM_HAVE_NEON)
+  __try{
+    /*VORR q0,q0,q0*/
+    __emit(0xF2200150);
+    flags|=OPUS_CPU_ARM_NEON;
+  }
+  __except(GetExceptionCode()==EXCEPTION_ILLEGAL_INSTRUCTION){
+    /*Ignore exception.*/
+  }
+#   endif
+#  endif
+# endif
+  return flags;
+}
+
+#elif defined(__linux__)
+/* Linux based */
+opus_uint32 opus_cpu_capa(void)
+{
+  opus_uint32 flags = 0;
+  FILE *cpuinfo;
+
+  /* Reading /proc/self/auxv would be easier, but that doesn't work 
reliably on
+   * Android */
+  cpuinfo = fopen("/proc/cpuinfo", "r");
+
+  if(cpuinfo != NULL)
+  {
+    /* 512 should be enough for anybody (it's even enough for all the 
flags that
+     * x86 has accumulated... so far). */
+    char buf[512];
+
+    while(fgets(buf, 512, cpuinfo) != NULL)
+    {
+      /* Search for edsp and neon flag */
+      if(memcmp(buf, "Features", 8) == 0)
+      {
+        char *p;
+        p = strstr(buf, " edsp");
+        if(p != NULL && (p[5] == ' ' || p[5] == '\n'))
+          flags |= OPUS_CPU_ARM_EDSP;
+
+        p = strstr(buf, " neon");
+        if(p != NULL && (p[5] == ' ' || p[5] == '\n'))
+          flags |= OPUS_CPU_ARM_NEON;
+      }
+
+      /* Search for media capabilities (>= ARMv6) */
+      if(memcmp(buf, "CPU architecture:", 17) == 0)
+      {
+        int version;
+        version = atoi(buf+17);
+
+        if(version == 4)
+          flags |= OPUS_CPU_ARM_V4;

[TBT] The Windows code isn't detecting this, and I don't think we 
should, either. We should just assume it's true.

+
+        if(version >= 6)
+          flags |= OPUS_CPU_ARM_MEDIA;
+      }
+    }
+
+    fclose(cpuinfo);
+  }
+  return flags;
+}
+#else
+/* The feature registers which can tell us what the processor supports are
+ * accessible in priveleged modes only, so we can't have a general 
user-space
+ * detection method like on x86.*/
+# error "Configured to use ARM asm but no CPU detection method 
available for " \
+   "your platform.  Reconfigure with --disable-asm (or send patches)."
+#endif
+

[TBT] This flags -> arch conversion should work in the reverse order. 
I.e., we should check the EDSP, MEDIA, and NEON flags in that order, and 
the first one we don't have, we should stop and return the prior one as 
our arch index.

+int opus_select_arch(void)
+{
+  opus_uint32 flags = opus_cpu_capa();
+
+  if(flags & OPUS_CPU_ARM_NEON)
+    return 4;
+  else if(flags & OPUS_CPU_ARM_MEDIA)
+    return 3;
+  else if(flags & OPUS_CPU_ARM_EDSP)
+    return 2;
+  else if(flags & OPUS_CPU_ARM_V4)
+    return 1;

[TBT] Don't need to test for ARMv4.

+  else
+    return 0;
+}


More information about the opus mailing list