[opus] opus Digest, Vol 70, Issue 3

Fri Nov 7 02:18:06 PST 2014

Hi All,

Cortex-M4 is a single issue CPU whereas A8 is dual issue so this is the main
reason you are seeing a slow-down, use of NEON I would say is secondary,
certainly for CELT.  We (ESPICO) have done optimisation work on OPUS v1.1
and have ARM implementations about twice the speed of the 'off the shelf'
version. Please contact me directly if you want to discuss further.

Cliff

-----Original Message----- 
From: opus-request at xiph.org
Sent: Thursday, November 06, 2014 8:00 PM
To: opus at xiph.org
Subject: opus Digest, Vol 70, Issue 3

Send opus mailing list submissions to
opus at xiph.org

To subscribe or unsubscribe via the World Wide Web, visit
http://lists.xiph.org/mailman/listinfo/opus
or, via email, send a message with subject or body 'help' to
opus-request at xiph.org

You can reach the person managing the list at
opus-owner at xiph.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of opus digest..."


Today's Topics:

   1. Re: opus Digest, Vol 70, Issue 1 (Heng Lou)
   2. [PATCH] float_cast: Fix MSVC ARM build (Hugo Beauz?e-Luyssen)
   3. Re: [PATCH] float_cast: Fix MSVC ARM build (Martin Storsj?)


----------------------------------------------------------------------

Message: 1
Date: Wed, 5 Nov 2014 20:19:04 +0000
From: Heng Lou <Heng_Lou at starkey.com>
Subject: Re: [opus] opus Digest, Vol 70, Issue 1
To: "opus at xiph.org" <opus at xiph.org>
Message-ID:
<D0ED62510480544BA069AFE0E7A549460148B3C218 at ep2p-exmbs2.ms.starkey.com>

Content-Type: text/plain; charset="us-ascii"

What is the possibility to use the Cortex-M4 DSP instructions to fully
optimize the OPUS code?  Could we use the ARM CMSIS DSP library for this
optimization?

Thanks,

Heng

-----Original Message-----
From: opus-bounces at xiph.org [mailto:opus-bounces at xiph.org] On Behalf Of
opus-request at xiph.org
Sent: Tuesday, November 04, 2014 2:00 PM
To: opus at xiph.orgis
Subject: opus Digest, Vol 70, Issue 1

Send opus mailing list submissions to
opus at xiph.org

To subscribe or unsubscribe via the World Wide Web, visit
http://lists.xiph.org/mailman/listinfo/opus
or, via email, send a message with subject or body 'help' to
opus-request at xiph.org

You can reach the person managing the list at
opus-owner at xiph.org

When replying, please edit your Subject line so it is more specific than
"Re: Contents of opus digest..."


Today's Topics:

   1. Opus performance on Cortex-M4 (Andy Isaacson)
   2. Re: Opus performance on Cortex-M4 (Jean-Marc Valin)


----------------------------------------------------------------------

Message: 1
Date: Mon, 3 Nov 2014 16:36:29 -0800
From: Andy Isaacson <adi at hexapodia.org>
Subject: [opus] Opus performance on Cortex-M4
To: opus at xiph.org
Message-ID: <20141104003629.GA20904 at hexapodia.org>
Content-Type: text/plain; charset=us-ascii

I'm considering implementing Opus as the codec for an embedded ARM-based
battery powered audio system.  In the interest of battery life and board
footprint I'd like to specify the smallest CPU that can do the job.

In some quick testing on Cortex-A8 (a very different core, but at least ISA
compatible and hopefully fairly similar to M4 for things like cycle counts
and code size) I saw promising results -- about 30 MHz of A8 CPU was
sufficient to encode an audio stream using the 1.1.1-beta fixed point codec
at 48 kHz mono, complexity=5, bitrate=20kbit/sec.

Since the target SoCs tend to have an M3 or M4 running up to 100-150 MHz,
and power consumption runs nearly linearly with clock speed, this seemed to
give us some headroom to run the rest of our application stack and tune for
battery life.

However now that we're doing a first implementation on M4, we're seeing
significantly higher cycle counts -- more in the range of 100 MHz of CPU
needed to encode with the same parameters.  Additionally, compared to 1.0.3,
the code size and data size of the Opus codec in 1.1 has increased
significantly (which makes it a challenge to fit in the on-SoC SRAM of the
M4).

Obviously we need to use the ARM ASM that landed in -beta, and we can
decrease the complexity to somewhat reduce the CPU utilization, but I'm
wondering if I'm missing any other low-hanging fruit in optimizing Opus for
this target CPU.  I haven't even started to do code profiling or CPU
performance counter analysis.

Does anyone have examples of similar applications?  What kinds of CPU
occupancy have other people seen on similar CPUs?  Do we need to get some
NEON asm?  Does anybody have spare cycles to take paid work in this space?

-andy


------------------------------

Message: 2
Date: Mon, 03 Nov 2014 20:32:30 -0500
From: Jean-Marc Valin <jmvalin at jmvalin.ca>
Subject: Re: [opus] Opus performance on Cortex-M4
To: Andy Isaacson <adi at hexapodia.org>, opus at xiph.org
Message-ID: <54582CAE.9080806 at jmvalin.ca>
Content-Type: text/plain; charset=windows-1252

Hi Andy,

On 03/11/14 07:36 PM, Andy Isaacson wrote:
> In some quick testing on Cortex-A8 (a very different core, but at
> least ISA compatible and hopefully fairly similar to M4 for things
> like cycle counts and code size) I saw promising results -- about 30
> MHz of A8 CPU was sufficient to encode an audio stream using the
> 1.1.1-beta fixed point codec at 48 kHz mono, complexity=5,
> bitrate=20kbit/sec.

First, I think the big difference between the M4 and the A8 is that A8 has
Neon, which Opus is able to use.

> However now that we're doing a first implementation on M4, we're
> seeing significantly higher cycle counts -- more in the range of 100
> MHz of CPU needed to encode with the same parameters.  Additionally,
> compared to 1.0.3, the code size and data size of the Opus codec in
> 1.1 has increased significantly (which makes it a challenge to fit in
> the on-SoC SRAM of the M4).

I suspect most of the size increase you're seeing is from the new code in
src/analysis.c which you do not need. In fact, if you're operating at
20 kb/s for speech, then you can entirely remove the CELT encoder from your
build. You still need the decoder because there's no guarantee what the
remote end will send you.

> Obviously we need to use the ARM ASM that landed in -beta, and we can
> decrease the complexity to somewhat reduce the CPU utilization, but
> I'm wondering if I'm missing any other low-hanging fruit in optimizing
> Opus for this target CPU.  I haven't even started to do code profiling
> or CPU performance counter analysis.

There's a few things to check. First, make sure that OPUS_ARM_INLINE_EDSP
(enabling DSP extensions) is defined in your config.h. Also, check for
OPUS_ARM_ASM and OPUS_HAVE_RTCD. That means all the asm is enabled. At that
point, the best is to run the profiles to see where the CPU time is spent.

Cheers,

Jean-Marc


------------------------------

_______________________________________________
opus mailing list
opus at xiph.org
http://lists.xiph.org/mailman/listinfo/opus


End of opus Digest, Vol 70, Issue 1
***********************************


------------------------------

Message: 2
Date: Thu,  6 Nov 2014 17:33:48 +0100
From: Hugo Beauz?e-Luyssen <hugo at beauzee.fr>
Subject: [opus] [PATCH] float_cast: Fix MSVC ARM build
To: opus at xiph.org
Message-ID: <1415291628-8419-1-git-send-email-hugo at beauzee.fr>

---
celt/float_cast.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/celt/float_cast.h b/celt/float_cast.h
index ede6574..4892e2c 100644
--- a/celt/float_cast.h
+++ b/celt/float_cast.h
@@ -90,14 +90,14 @@
#include <math.h>
#define float2int(x) lrint(x)

-#elif (defined(_MSC_VER) && _MSC_VER >= 1400) && (defined (WIN64) ||
defined (_WIN64))
+#elif (defined(_MSC_VER) && _MSC_VER >= 1400) && (defined (WIN64) ||
defined (_WIN64)) && !defined(_M_ARM)
         #include <xmmintrin.h>

         __inline long int float2int(float value)
         {
                 return _mm_cvtss_si32(_mm_load_ss(&value));
         }
-#elif (defined(_MSC_VER) && _MSC_VER >= 1400) && (defined (WIN32) ||
defined (_WIN32))
+#elif (defined(_MSC_VER) && _MSC_VER >= 1400) && (defined (WIN32) ||
defined (_WIN32)) && !defined(_M_ARM)
         #include <math.h>

         /*      Win32 doesn't seem to have these functions.
-- 
2.1.1



------------------------------

Message: 3
Date: Thu, 6 Nov 2014 19:16:19 +0200 (EET)
From: Martin Storsj? <martin at martin.st>
Subject: Re: [opus] [PATCH] float_cast: Fix MSVC ARM build
To: Hugo Beauz?e-Luyssen <hugo at beauzee.fr>
Cc: opus at xiph.org
Message-ID: <alpine.DEB.2.02.1411061914180.4328 at cone.martin.st>
Content-Type: text/plain; charset="iso-8859-15"

On Thu, 6 Nov 2014, Hugo Beauz?e-Luyssen wrote:

> ---
> celt/float_cast.h | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/celt/float_cast.h b/celt/float_cast.h
> index ede6574..4892e2c 100644
> --- a/celt/float_cast.h
> +++ b/celt/float_cast.h
> @@ -90,14 +90,14 @@
> #include <math.h>
> #define float2int(x) lrint(x)
>
> -#elif (defined(_MSC_VER) && _MSC_VER >= 1400) && (defined (WIN64) ||
> defined (_WIN64))
> +#elif (defined(_MSC_VER) && _MSC_VER >= 1400) && (defined (WIN64) ||
> defined (_WIN64)) && !defined(_M_ARM)
>         #include <xmmintrin.h>
>
>         __inline long int float2int(float value)
>         {
>                 return _mm_cvtss_si32(_mm_load_ss(&value));
>         }
> -#elif (defined(_MSC_VER) && _MSC_VER >= 1400) && (defined (WIN32) ||
> defined (_WIN32))
> +#elif (defined(_MSC_VER) && _MSC_VER >= 1400) && (defined (WIN32) ||
> defined (_WIN32)) && !defined(_M_ARM)
>         #include <math.h>
>
>         /*      Win32 doesn't seem to have these functions.
> -- 
> 2.1.1

As MSVC might support other architectures than arm and x86 (they did
support mips, alpha and itanium at some points in time), I think it might
be better to use this instead:

... && (defined(_M_IX86) || defined(_M_X64))

// Martin

------------------------------

_______________________________________________
opus mailing list
opus at xiph.org
http://lists.xiph.org/mailman/listinfo/opus


End of opus Digest, Vol 70, Issue 3
***********************************