[opus] Alleged bug in Silk codec
jmvalin at jmvalin.ca
Tue Jul 1 08:24:18 PDT 2014
FYI I just pushed the 64-bit C0 patch.
On 25/06/14 06:37 AM, Marcello Caramma (mcaramma) wrote:
> Yes, regarding the unsigned to signed conversion you are right, it is
> implementation defined. I just had an issue a couple of years ago with
> a compiler which incorrectly treated unsigned overflow as undefined
> rather than implementation defined…
> Regarding the 64 bit profiling: I looked at the disassembly (gcc –c –S
> –O2 ../opus/silk/sum_sqr_shift.c –I../opus/include –I../opus/celt) of
> the 64 bit accumulator version (unrolled twice like the current code)
> and found that, as well as having only one loop, the loop has 12
> instructions per iteration.
> The current version (after fixing the bug) gives 12 instructions per
> iteration until shift becomes non zero (first loop). After that there
> are 16 instructions per iteration in the second loop, and that without
> counting the additional instructions we jump to when we need to handle
> the overflow and increase the shift.
> Although this is not proper profiling, it is good enough for me. And
> considering the increase in code clarity, I would definitely go for the
> 64 bit version.
> Just for kicks, I also made an experimental version that uses 32 bits
> but leaves 8 bits margin before accumulating, and that shaves 2
> instructions per loop, making it 10. All in all, I am not sure it is
> worth the hassle though considering this is only about 1/order of the
> cost of the autocorrelation…
> From: KOEN VOS <koenvos74 at gmail.com <mailto:koenvos74 at gmail.com>>
> Date: Friday, 20 June 2014 22:04
> To: "Timothy B. Terriberry" <tterribe at xiph.org <mailto:tterribe at xiph.org>>
> Cc: "opus at xiph.org <mailto:opus at xiph.org>" <opus at xiph.org
> <mailto:opus at xiph.org>>, Marcello Caramma <mcaramma at cisco.com
> <mailto:mcaramma at cisco.com>>
> Subject: Re: [opus] Alleged bug in Silk codec
> Yes those instructions exist, although they're a bit slower than the
> basic 16x16->32 with 32-bit accumulation (SMLABB). So I'd be surprised
> if the function with 64 bit accumulation would run as fast as the
> current code. Don't know how much we care about 16-bit platforms. And
> accuracy should not matter.
> On the other hand, a 64-bit implementation is much cleaner/shorter,
> which is always a good argument :-)
> All in all, no strong preference from my side.
> On Fri, Jun 20, 2014 at 12:40 PM, Timothy B. Terriberry
> <tterribe at xiph.org <mailto:tterribe at xiph.org>> wrote:
> KOEN VOS wrote:
> current implementation is faster on a 32 or 16 bit processor.
> If you
> would find the opposite to be true (ie that a 64 bit
> implementation is
> faster on, say, a 32 bit ARM CPU) then perhaps we should reconsider.
> Doesn't ARMv6 have a dual signed 16x16->32 multiply with a 64-bit
> accumulator (SMLALD)? Even v5E should have a single 16x16->32 with a
> 64-bit accumulator (SMLALBB). I would think a 64-bit version could
> be made pretty fast on 32-bit ARM, without even resorting to SIMD.
More information about the opus