[opus] [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON

Thu Apr 6 19:55:40 UTC 2017

Hi Linfeng,

I had a closer look at your patch and the code looks good -- and
slightly simpler than I had anticipated, so that's good.

I did some profiling on a Cortex A57 and I've been seeing slightly less
improvement than you're reporting, more like 3.5% at complexity 8. It
appears that the warped autocorrelation function itself is only faster
by a factor of about 1.35. That's a bit surprising considering I see
nothing obviously wrong with the code.

I'm not sure what's the problem, but here's a few things that may be
worth considering:
1) In calc_state(), rather than splitting the multiply in two
instructions, you may be able to simply shift the warping left 16 bits,
then use the Neon instruction that does a*b>>32 (i.e. the one that
computes the top bits of a 32x32 multiply)
2) If the problem is with the movs at the end of each iteration, then
you should be able to get rid of them by unrolling by a factor of two.
3) It seems likely that you have significant register spill going on due
to processing up to 24 "taps" at the same time. If that's causing a
slowdown, then it should be possible to do the processing in "sections".
By that, I mean that you can implement (e.g.) an order-8 "kernel" that
computes the correlations and also outputs the last element of
state_QS_s32x4[0][0] back to input_QS[], so that it can be used to
compute a new secion.
4) It's a minor detail, but the last element of corr_QC[] that's not
currently vectorized could simply be vectorized independently outside
the loop (and it's the same for all orders).

Cheers,

	Jean-Marc

On 05/04/17 02:13 PM, Linfeng Zhang wrote:
> Thank Jean-Marc!
> 
> The speedup percentages are all relative to the entire encoder.
> 
> Comparing to master, this optimization patch speeds up fixed-point SILK
> encoder on NEON as following: Complexity 5: 6.1% Complexity 6: 5.8%
> Complexity 8: 5.5% Complexity 10: 4.0%
> 
> when testing on an Acer Chromebook, ARMv7 Processor rev 3 (v7l), CPU max
> MHz: 2116.5
> 
> Thanks,
> Linfeng
> 
> On Wed, Apr 5, 2017 at 11:02 AM, Jean-Marc Valin <jmvalin at jmvalin.ca
> <mailto:jmvalin at jmvalin.ca>> wrote:
> 
>     Hi Linfeng,
> 
>     Thanks for the updated patch. I'll have a look and get back to you. When
>     you report speedup percentages, is that relative to the entire encoder
>     or relative to just that function in C? Also, what's the speedup
>     compared to master?
> 
>     Cheers,
> 
>             Jean-Marc
> 
>     On 05/04/17 12:14 PM, Linfeng Zhang wrote:
>     > I attached a new patch with small cleanup (disassembly is identical as
>     > the last patch). We have done the same internal testing as usual.
>     >
>     > Also, attached 2 failed temporary versions which try to reduce code size
>     > (just for code review reference purpose).
>     >
>     > The new patch of silk_warped_autocorrelation_FIX_neon() has a code size
>     > of 3,228 bytes (with gcc).
>     > smaller_slower.c has a code size of 2,304 bytes, but the encoder is
>     > about 1.8% - 2.7% slower.
>     > smallest_slowest.c has a code size of 1,656 bytes, but the encoder is
>     > about 2.3% - 3.6% slower.
>     >
>     > Thanks,
>     > Linfeng
>     >
>     > On Mon, Apr 3, 2017 at 3:01 PM, Linfeng Zhang <linfengz at google.com <mailto:linfengz at google.com>
>     > <mailto:linfengz at google.com <mailto:linfengz at google.com>>> wrote:
>     >
>     >     Hi Jean-Marc,
>     >
>     >     Attached is the silk_warped_autocorrelation_FIX_neon() which
>     >     implements your idea.
>     >
>     >     Speed improvement vs the previous optimization:
>     >
>     >     Complexity 0-4: Doesn't call this function. Complexity 5: 2.1%
>     >     (order = 16) Complexity 6: 1.0% (order = 20) Complexity 8: 0.1%
>     >     (order = 24) Complexity 10: 0.1% (order = 24)
>     >
>     >     Code size of silk_warped_autocorrelation_FIX_neon() changes from
>     >     2,644 bytes to 3,228 bytes.
>     >
>     >     The reason of larger code size is that the new optimization
>     >     specializes order 16, 20 and 24. If only keeping order 24
>     >     specialization, the code still works and the code size is smaller,
>     >     but the encoder speed will drop 4.0% for Complexity 5 and 2.0% for
>     >     Complexity 6. Anyway, the new code is easier to understand and maintain.
>     >
>     >     Thanks,
>     >
>     >     Linfeng
>     >
>     >
>     >     On Tue, Feb 7, 2017 at 8:58 AM, Linfeng Zhang <linfengz at google.com <mailto:linfengz at google.com>
>     >     <mailto:linfengz at google.com <mailto:linfengz at google.com>>> wrote:
>     >
>     >         Hi Jean-Marc,
>     >
>     >         Thanks for your suggestions. Will get back to you once we have
>     >         some updates.
>     >
>     >         Linfeng
>     >
>     >         On Mon, Feb 6, 2017 at 5:47 PM, Jean-Marc Valin
>     >         <jmvalin at jmvalin.ca <mailto:jmvalin at jmvalin.ca>
>     <mailto:jmvalin at jmvalin.ca <mailto:jmvalin at jmvalin.ca>>> wrote:
>     >
>     >             Hi Linfeng,
>     >
>     >             On 06/02/17 07:18 PM, Linfeng Zhang wrote:
>     >             > This is a great idea. But the order
>     (psEncC->shapingLPCOrder) can be
>     >             > configured to 12, 14, 16, 20 and 24 according to
>     complexity parameter.
>     >             >
>     >             > It's hard to get a universal function to handle all
>     these orders
>     >             > efficiently. Any suggestions?
>     >
>     >             I can think of two ways of handling larger orders. The
>     >             obvious one is
>     >             simply to add an inner loop of the form:
>     >             for (i=0;i<order;i+=VECTOR_SIZE)
>     >             I think what may be more efficient is to simply have a
>     small
>     >             "order-N"
>     >             (N=4 or 8) kernel that not only computes the
>     correlation of
>     >             order N, but
>     >             then spits out the signal after the N-stage all-pass is
>     >             applied. The
>     >             kernel would look like:
>     >
>     >             void autocorr_kernel4(int *corr, int *orig, int
>     *input, int
>     >             *output, int
>     >             len) {
>     >                /* Implement vectorized order-4 filter (could also be
>     >             order 8)
>     >                   as described in previous email and outputs the
>     >             filtered signal.
>     >                */
>     >             }
>     >
>     >             and then the full function would run the kernel multiple
>     >             times and look
>     >             like:
>     >
>     >             void full_autocorr(int *corr, int *orig, int len, int
>     order) {
>     >                int i;
>     >                int tmp[MAX_SIZE];
>     >                int *in = orig;
>     >                for (i=0;i<order;i+=4) {
>     >                   autocorr_kernel4(corr+i, orig, in, tmp, len);
>     >                   /* Make subsequent calls use the filtered signal as
>     >             input. */
>     >                   in = tmp;
>     >                }
>     >             }
>     >
>     >             I think the should not only reduce/eliminate the
>     >             prologue/epilogue
>     >             problem, but it should also be more efficient since almost
>     >             all vectors
>     >             processed would use the full size.
>     >
>     >             Maybe a third option (not sure it's a good idea, but still
>     >             mentioning
>     >             it) would be to have a function that hardcodes
>     order=24 and
>     >             discards the
>     >             larger values that aren't needed. Since the smallest order
>     >             seems to be
>     >             16, it wouldn't be much of a waste and the code might
>     end up
>     >             running
>     >             faster for the higher orders.
>     >
>     >             Cheers,
>     >
>     >                     Jean-Marc
>     >
>     >
>     >             > Thanks,
>     >             > Linfeng
>     >             >
>     >             > On Mon, Feb 6, 2017 at 12:40 PM, Jean-Marc Valin
>     <jmvalin at jmvalin.ca <mailto:jmvalin at jmvalin.ca>
>     <mailto:jmvalin at jmvalin.ca <mailto:jmvalin at jmvalin.ca>>
>     >             > <mailto:jmvalin at jmvalin.ca
>     <mailto:jmvalin at jmvalin.ca> <mailto:jmvalin at jmvalin.ca
>     <mailto:jmvalin at jmvalin.ca>>>>
>     >             wrote:
>     >             >
>     >             >     Hi Linfeng,
>     >             >
>     >             >     On 06/02/17 02:51 PM, Linfeng Zhang wrote:
>     >             >     > However, the critical thing is that all the states
>     >             in each stage when
>     >             >     > processing input[i] are reused by the next
>     >             input[i+1]. That is
>     >             >     > input[i+1] must wait input[i] for 1 stage, and
>     >             input[i+2] must wait
>     >             >     > input[i+1] for 1 stage, etc.
>     >             >
>     >             >     That is indeed the tricky part... and the one I
>     think
>     >             you could do
>     >             >     slightly differently. If you approach the problem in
>     >             terms of computing
>     >             >     chunks of the inputs N samples at a time, then
>     indeed
>     >             the approach you
>     >             >     are describing is the only solution. What I was
>     >             proposing though is to
>     >             >     instead chop the "order" in chunks of N. Using your
>     >             notation, you would
>     >             >     be doing:
>     >             >
>     >             >     PROC(
>     >                   in0(s0))
>     >             >     PROC(
>     >             in0(s1) in1(s0))
>     >             >     PROC(                                        in0(s2)
>     >             in1(s1) in2(s0))
>     >             >     PROC(                                in0(s3) in1(s2)
>     >             in2(s1) in3(s0))
>     >             >     PROC(                        in0(s4) in1(s3) in2(s2)
>     >             in3(s1) in4(s0))
>     >             >     PROC(                in0(s5) in1(s4) in2(s3) in3(s2)
>     >             in4(s1) in5(s0))
>     >             >     PROC(        in0(s6) in1(s5) in2(s4) in3(s3) in4(s2)
>     >             in5(s1) in6(s0))
>     >             >     PROC(in0(s7) in1(s6) in2(s5) in3(s4) in4(s3) in5(s2)
>     >             in6(s1) in7(s0))
>     >             >     PROC(in1(s7) in2(s6) in3(s5) in4(s4) in5(s3) in6(s2)
>     >             in7(s1) in8(s0))
>     >             >     PROC(in2(s7) in3(s6) in4(s5) in5(s4) in6(s3) in7(s2)
>     >             in8(s1) in9(s0))
>     >             >     PROC(in3(s7) in4(s6) in5(s5) in6(s4) in7(s3) in8(s2)
>     >             in9(s1)in10(s0))
>     >             >     PROC(in4(s7) in5(s6) in6(s5) in7(s4) in8(s3)
>     >             in9(s2)in10(s1)in11(s0))
>     >             >     ...and so on until the end of the input vector
>     >             >
>     >             >     The difference is that it's now the input vector
>     that
>     >             "slides" and the
>     >             >     "state" values sy that remain in the same place.
>     >             There's still a
>     >             >     prologue, but you can easily get rid of it by
>     >             (implicitly) zero-padding
>     >             >     the in vector during the initialization phase (start
>     >             with a zero vector
>     >             >     and real one value at a time). Getting rid of the
>     >             epilogue is a little
>     >             >     trickier, but I think it can be done.
>     >             >
>     >             >     Cheers,
>     >             >
>     >             >             Jean-Marc
>     >             >
>     >             >     > Then it becomes this
>     >             >     >
>     >             >     > FOR in=0 to N WITH in+=8
>     >             >     >   PROC(in0(s0)) /* prolog 0 */
>     >             >     >   PROC(in0(s1) in1(s0)) /* prolog 1 */
>     >             >     >   PROC(in0(s2) in1(s1) in2(s0)) /* prolog 2 */
>     >             >     >   PROC(in0(s3) in1(s2) in2(s1) in3(s0)) /*
>     prolog 3 */
>     >             >     >   PROC(in0(s4) in1(s3) in2(s2) in3(s1) in4(s0)) /*
>     >             prolog 4 */
>     >             >     >   PROC(in0(s5) in1(s4) in2(s3) in3(s2) in4(s1)
>     >             in5(s0)) /* prolog 5 */
>     >             >     >   PROC(in0(s6) in1(s5) in2(s4) in3(s3) in4(s2)
>     >             in5(s1) in6(s0)) /*
>     >             >     > prolog 6 */
>     >             >     >   PROC(in0(s7) in1(s6) in2(s5) in3(s4) in4(s3)
>     >             in5(s2) in6(s1)
>     >             >     in7(s0))
>     >             >     > /* fully process 8 inputs */
>     >             >     >   PROC(in0(s8) in1(s7) in2(s6) in3(s5) in4(s4)
>     >             in5(s3) in6(s2)
>     >             >     in7(s1))
>     >             >     > /* continue */
>     >             >     >   PROC(in0(s9) in1(s8) in2(s7) in3(s6) in4(s5)
>     >             in5(s4) in6(s3)
>     >             >     in7(s2))
>     >             >     > /* continue */
>     >             >     >   PROC(in0(s10) in1(s9) in2(s8) in3(s7) in4(s6)
>     >             in5(s5) in6(s4)
>     >             >     in7(s3))
>     >             >     > /* continue */
>     >             >     >   PROC(in1(s10) in2(s9) in3(s8) in4(s7) in5(s6)
>     >             in6(s5) in7(s4)) /*
>     >             >     > epilog 0 */
>     >             >     >   PROC(in2(s10) in3(s9) in4(s8) in5(s7) in6(s6)
>     >             in7(s5)) /* epilog
>     >             >     1 */
>     >             >     >   PROC(in3(s10) in4(s9) in5(s8) in6(s7)
>     in7(s6)) /*
>     >             epilog 2 */
>     >             >     >   PROC(in4(s10) in5(s9) in6(s8) in7(s7)) /*
>     epilog 3 */
>     >             >     >   PROC(in5(s10) in6(s9) in7(s8)) /* epilog 4 */
>     >             >     >   PROC(in6(s10) in7(s9)) /* epilog 5 */
>     >             >     >   PROC(in7(s10)) /* epilog 6 */
>     >             >     > END FOR
>     >             >     >
>     >             >     > And
>     >             >     >   PROC(in0(s7) in1(s6) in2(s5) in3(s4) in4(s3)
>     >             in5(s2) in6(s1)
>     >             >     in7(s0))
>     >             >     > /* fully process 8 inputs */
>     >             >     >   PROC(in0(s8) in1(s7) in2(s6) in3(s5) in4(s4)
>     >             in5(s3) in6(s2)
>     >             >     in7(s1))
>     >             >     > /* continue */
>     >             >     >   PROC(in0(s9) in1(s8) in2(s7) in3(s6) in4(s5)
>     >             in5(s4) in6(s3)
>     >             >     in7(s2))
>     >             >     > /* continue */
>     >             >     > is actually the expansion of the kernel loop
>     >             >     > FOR i=0 TO order-6 WITH i++
>     >             >     >   PROC(in0(si+7) in1(si+6) in2(si+5) in3(si+4)
>     >             in4(si+3) in5(si+2)
>     >             >     > in6(si+1) in7(si+0))
>     >             >     > END FOR
>     >             >     >
>     >             >     > The worst thing is that corr_QC[] is so sensitive
>     >             that any extra
>     >             >     > processing will make them wrong and propagate
>     to the
>     >             next loop (next 8
>     >             >     > inputs). state_QS[] is a little better but still
>     >             very sensitive. For
>     >             >     > instance, if adding PROC(in0(s11') in1(s10)
>     in2(s9)
>     >             in3(s8) in4(s7)
>     >             >     > in5(s6) in6(s5) in7(s4)) to the kernel loop (by
>     >             looping one more time)
>     >             >     > and remove epilog 0, then all final results
>     will be
>     >             wrong.
>     >             >     >
>     >             >     > That's why the prolog and epilog cannot be
>     saved to
>     >             the best of my
>     >             >     > knowledge.
>     >             >     >
>     >             >     > The assembly size of
>     >             silk_warped_autocorrelation_FIX_neon() is about
>     >             >     > 2,744 bytes. Compared with the C code size (about
>     >             452 bytes), it's 2.3
>     >             >     > KB larger. Considering
>     >             silk_warped_autocorrelation_FIX_c() is the
>     >             >     second
>     >             >     > place CPU heavy function in fixed-point, and our
>     >             testing shows up
>     >             >     to 7%
>     >             >     > CPU run time saving of the total encoder with this
>     >             optimization (at
>     >             >     > Complexity 8), maybe we can take the I-cache
>     burden
>     >             even if finally we
>     >             >     > still cannot remove the big chunk of prolog
>     and epilog.
>     >             >     >
>     >             >     > Thanks,
>     >             >     > Linfeng Zhang
>     >             >     >
>     >             >     > On Sat, Feb 4, 2017 at 4:17 PM, Jean-Marc Valin
>     >             >     <jmvalin at jmvalin.ca <mailto:jmvalin at jmvalin.ca>
>     <mailto:jmvalin at jmvalin.ca <mailto:jmvalin at jmvalin.ca>>
>     >             <mailto:jmvalin at jmvalin.ca <mailto:jmvalin at jmvalin.ca>
>     <mailto:jmvalin at jmvalin.ca <mailto:jmvalin at jmvalin.ca>>>
>     >             >     > <mailto:jmvalin at jmvalin.ca
>     <mailto:jmvalin at jmvalin.ca>
>     >             <mailto:jmvalin at jmvalin.ca
>     <mailto:jmvalin at jmvalin.ca>> <mailto:jmvalin at jmvalin.ca
>     <mailto:jmvalin at jmvalin.ca>
>     >             <mailto:jmvalin at jmvalin.ca
>     <mailto:jmvalin at jmvalin.ca>>>>> wrote:
>     >             >     >
>     >             >     >     Hi Felicia,
>     >             >     >
>     >             >     >     I've had time to work through the math in the
>     >             original
>     >             >     function and I'm
>     >             >     >     pretty sure it's possible to vectorize this
>     >             without the huge
>     >             >     >     prologue/epilogue.
>     >             >     >
>     >             >     >     For the simple case where order = vector
>     size =
>     >             N (but it
>     >             >     should easily
>     >             >     >     generalize to larger order), what I came
>     up with is:
>     >             >     >
>     >             >     >     initialize X, Y, M, C to vector of zeros
>     >             >     >
>     >             >     >     for i=0 to N+order
>     >             >     >        T = [x(i), Y(0:N-2)]
>     >             >     >        Y = M + coeff * (Y - T)
>     >             >     >        M = T
>     >             >     >        X = [x(i), X(0:N-1)]
>     >             >     >        C = C + X*Y
>     >             >     >
>     >             >     >     I think something similar to this (assuming I
>     >             didn't mess up any
>     >             >     >     details) should give you the correlations in
>     >             vector C. Did I miss
>     >             >     >     anything?
>     >             >     >
>     >             >     >     Cheers,
>     >             >     >
>     >             >     >             Jean-Marc
>     >             >     >
>     >             >     >
>     >             >     >     On 31/01/17 12:30 PM, Felicia Lim wrote:
>     >             >     >     > Hi,
>     >             >     >     >
>     >             >     >     > Attached is a patch with arm neon
>     >             optimizations for
>     >             >     >     > silk_warped_autocorrelation_FIX().
>     Please review.
>     >             >     >     >
>     >             >     >     > Thanks,
>     >             >     >     > Felicia
>     >             >     >     >
>     >             >     >     >
>     >             >     >     >
>     _______________________________________________
>     >             >     >     > opus mailing list
>     >             >     >     > opus at xiph.org <mailto:opus at xiph.org>
>     <mailto:opus at xiph.org <mailto:opus at xiph.org>>
>     >             <mailto:opus at xiph.org <mailto:opus at xiph.org>
>     <mailto:opus at xiph.org <mailto:opus at xiph.org>>>
>     >             <mailto:opus at xiph.org <mailto:opus at xiph.org>
>     <mailto:opus at xiph.org <mailto:opus at xiph.org>>
>     >             >     <mailto:opus at xiph.org <mailto:opus at xiph.org>
>     <mailto:opus at xiph.org <mailto:opus at xiph.org>>>>
>     >             >     >     > http://lists.xiph.org/mailman/listinfo/opus
>     <http://lists.xiph.org/mailman/listinfo/opus>
>     >             <http://lists.xiph.org/mailman/listinfo/opus
>     <http://lists.xiph.org/mailman/listinfo/opus>>
>     >             >     <http://lists.xiph.org/mailman/listinfo/opus
>     <http://lists.xiph.org/mailman/listinfo/opus>
>     >             <http://lists.xiph.org/mailman/listinfo/opus
>     <http://lists.xiph.org/mailman/listinfo/opus>>>
>     >             >     >     <http://lists.xiph.org/mailman/listinfo/opus
>     <http://lists.xiph.org/mailman/listinfo/opus>
>     >             <http://lists.xiph.org/mailman/listinfo/opus
>     <http://lists.xiph.org/mailman/listinfo/opus>>
>     >             >     <http://lists.xiph.org/mailman/listinfo/opus
>     <http://lists.xiph.org/mailman/listinfo/opus>
>     >             <http://lists.xiph.org/mailman/listinfo/opus
>     <http://lists.xiph.org/mailman/listinfo/opus>>>>
>     >             >     >     >
>     >             >     >     _______________________________________________
>     >             >     >     opus mailing list
>     >             >     >     opus at xiph.org <mailto:opus at xiph.org> <mailto:opus at xiph.org
>     <mailto:opus at xiph.org>>
>     >             <mailto:opus at xiph.org <mailto:opus at xiph.org> <mailto:opus at xiph.org
>     <mailto:opus at xiph.org>>>
>     >             <mailto:opus at xiph.org <mailto:opus at xiph.org> <mailto:opus at xiph.org
>     <mailto:opus at xiph.org>>
>     >             >     <mailto:opus at xiph.org <mailto:opus at xiph.org>
>     <mailto:opus at xiph.org <mailto:opus at xiph.org>>>>
>     >             >     >   
>      http://lists.xiph.org/mailman/listinfo/opus
>     <http://lists.xiph.org/mailman/listinfo/opus>
>     >             <http://lists.xiph.org/mailman/listinfo/opus
>     <http://lists.xiph.org/mailman/listinfo/opus>>
>     >             >     <http://lists.xiph.org/mailman/listinfo/opus
>     <http://lists.xiph.org/mailman/listinfo/opus>
>     >             <http://lists.xiph.org/mailman/listinfo/opus
>     <http://lists.xiph.org/mailman/listinfo/opus>>>
>     >             >     >   
>      <http://lists.xiph.org/mailman/listinfo/opus
>     <http://lists.xiph.org/mailman/listinfo/opus>
>     >             <http://lists.xiph.org/mailman/listinfo/opus
>     <http://lists.xiph.org/mailman/listinfo/opus>>
>     >             >     <http://lists.xiph.org/mailman/listinfo/opus
>     <http://lists.xiph.org/mailman/listinfo/opus>
>     >             <http://lists.xiph.org/mailman/listinfo/opus
>     <http://lists.xiph.org/mailman/listinfo/opus>>>>
>     >             >     >
>     >             >     >
>     >             >
>     >             >
>     >
>     >
>     >
>     >
> 
>