[opus] [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON
Linfeng Zhang
linfengz at google.com
Wed Apr 5 18:13:10 UTC 2017
Thank Jean-Marc!
The speedup percentages are all relative to the entire encoder.
Comparing to master, this optimization patch speeds up fixed-point SILK
encoder on NEON as following: Complexity 5: 6.1% Complexity 6: 5.8%
Complexity 8: 5.5% Complexity 10: 4.0%
when testing on an Acer Chromebook, ARMv7 Processor rev 3 (v7l), CPU max
MHz: 2116.5
Thanks,
Linfeng
On Wed, Apr 5, 2017 at 11:02 AM, Jean-Marc Valin <jmvalin at jmvalin.ca> wrote:
> Hi Linfeng,
>
> Thanks for the updated patch. I'll have a look and get back to you. When
> you report speedup percentages, is that relative to the entire encoder
> or relative to just that function in C? Also, what's the speedup
> compared to master?
>
> Cheers,
>
> Jean-Marc
>
> On 05/04/17 12:14 PM, Linfeng Zhang wrote:
> > I attached a new patch with small cleanup (disassembly is identical as
> > the last patch). We have done the same internal testing as usual.
> >
> > Also, attached 2 failed temporary versions which try to reduce code size
> > (just for code review reference purpose).
> >
> > The new patch of silk_warped_autocorrelation_FIX_neon() has a code size
> > of 3,228 bytes (with gcc).
> > smaller_slower.c has a code size of 2,304 bytes, but the encoder is
> > about 1.8% - 2.7% slower.
> > smallest_slowest.c has a code size of 1,656 bytes, but the encoder is
> > about 2.3% - 3.6% slower.
> >
> > Thanks,
> > Linfeng
> >
> > On Mon, Apr 3, 2017 at 3:01 PM, Linfeng Zhang <linfengz at google.com
> > <mailto:linfengz at google.com>> wrote:
> >
> > Hi Jean-Marc,
> >
> > Attached is the silk_warped_autocorrelation_FIX_neon() which
> > implements your idea.
> >
> > Speed improvement vs the previous optimization:
> >
> > Complexity 0-4: Doesn't call this function. Complexity 5: 2.1%
> > (order = 16) Complexity 6: 1.0% (order = 20) Complexity 8: 0.1%
> > (order = 24) Complexity 10: 0.1% (order = 24)
> >
> > Code size of silk_warped_autocorrelation_FIX_neon() changes from
> > 2,644 bytes to 3,228 bytes.
> >
> > The reason of larger code size is that the new optimization
> > specializes order 16, 20 and 24. If only keeping order 24
> > specialization, the code still works and the code size is smaller,
> > but the encoder speed will drop 4.0% for Complexity 5 and 2.0% for
> > Complexity 6. Anyway, the new code is easier to understand and
> maintain.
> >
> > Thanks,
> >
> > Linfeng
> >
> >
> > On Tue, Feb 7, 2017 at 8:58 AM, Linfeng Zhang <linfengz at google.com
> > <mailto:linfengz at google.com>> wrote:
> >
> > Hi Jean-Marc,
> >
> > Thanks for your suggestions. Will get back to you once we have
> > some updates.
> >
> > Linfeng
> >
> > On Mon, Feb 6, 2017 at 5:47 PM, Jean-Marc Valin
> > <jmvalin at jmvalin.ca <mailto:jmvalin at jmvalin.ca>> wrote:
> >
> > Hi Linfeng,
> >
> > On 06/02/17 07:18 PM, Linfeng Zhang wrote:
> > > This is a great idea. But the order
> (psEncC->shapingLPCOrder) can be
> > > configured to 12, 14, 16, 20 and 24 according to
> complexity parameter.
> > >
> > > It's hard to get a universal function to handle all these
> orders
> > > efficiently. Any suggestions?
> >
> > I can think of two ways of handling larger orders. The
> > obvious one is
> > simply to add an inner loop of the form:
> > for (i=0;i<order;i+=VECTOR_SIZE)
> > I think what may be more efficient is to simply have a small
> > "order-N"
> > (N=4 or 8) kernel that not only computes the correlation of
> > order N, but
> > then spits out the signal after the N-stage all-pass is
> > applied. The
> > kernel would look like:
> >
> > void autocorr_kernel4(int *corr, int *orig, int *input, int
> > *output, int
> > len) {
> > /* Implement vectorized order-4 filter (could also be
> > order 8)
> > as described in previous email and outputs the
> > filtered signal.
> > */
> > }
> >
> > and then the full function would run the kernel multiple
> > times and look
> > like:
> >
> > void full_autocorr(int *corr, int *orig, int len, int order)
> {
> > int i;
> > int tmp[MAX_SIZE];
> > int *in = orig;
> > for (i=0;i<order;i+=4) {
> > autocorr_kernel4(corr+i, orig, in, tmp, len);
> > /* Make subsequent calls use the filtered signal as
> > input. */
> > in = tmp;
> > }
> > }
> >
> > I think the should not only reduce/eliminate the
> > prologue/epilogue
> > problem, but it should also be more efficient since almost
> > all vectors
> > processed would use the full size.
> >
> > Maybe a third option (not sure it's a good idea, but still
> > mentioning
> > it) would be to have a function that hardcodes order=24 and
> > discards the
> > larger values that aren't needed. Since the smallest order
> > seems to be
> > 16, it wouldn't be much of a waste and the code might end up
> > running
> > faster for the higher orders.
> >
> > Cheers,
> >
> > Jean-Marc
> >
> >
> > > Thanks,
> > > Linfeng
> > >
> > > On Mon, Feb 6, 2017 at 12:40 PM, Jean-Marc Valin <
> jmvalin at jmvalin.ca <mailto:jmvalin at jmvalin.ca>
> > > <mailto:jmvalin at jmvalin.ca <mailto:jmvalin at jmvalin.ca>>>
> > wrote:
> > >
> > > Hi Linfeng,
> > >
> > > On 06/02/17 02:51 PM, Linfeng Zhang wrote:
> > > > However, the critical thing is that all the states
> > in each stage when
> > > > processing input[i] are reused by the next
> > input[i+1]. That is
> > > > input[i+1] must wait input[i] for 1 stage, and
> > input[i+2] must wait
> > > > input[i+1] for 1 stage, etc.
> > >
> > > That is indeed the tricky part... and the one I think
> > you could do
> > > slightly differently. If you approach the problem in
> > terms of computing
> > > chunks of the inputs N samples at a time, then indeed
> > the approach you
> > > are describing is the only solution. What I was
> > proposing though is to
> > > instead chop the "order" in chunks of N. Using your
> > notation, you would
> > > be doing:
> > >
> > > PROC(
> > in0(s0))
> > > PROC(
> > in0(s1) in1(s0))
> > > PROC( in0(s2)
> > in1(s1) in2(s0))
> > > PROC( in0(s3) in1(s2)
> > in2(s1) in3(s0))
> > > PROC( in0(s4) in1(s3) in2(s2)
> > in3(s1) in4(s0))
> > > PROC( in0(s5) in1(s4) in2(s3) in3(s2)
> > in4(s1) in5(s0))
> > > PROC( in0(s6) in1(s5) in2(s4) in3(s3) in4(s2)
> > in5(s1) in6(s0))
> > > PROC(in0(s7) in1(s6) in2(s5) in3(s4) in4(s3) in5(s2)
> > in6(s1) in7(s0))
> > > PROC(in1(s7) in2(s6) in3(s5) in4(s4) in5(s3) in6(s2)
> > in7(s1) in8(s0))
> > > PROC(in2(s7) in3(s6) in4(s5) in5(s4) in6(s3) in7(s2)
> > in8(s1) in9(s0))
> > > PROC(in3(s7) in4(s6) in5(s5) in6(s4) in7(s3) in8(s2)
> > in9(s1)in10(s0))
> > > PROC(in4(s7) in5(s6) in6(s5) in7(s4) in8(s3)
> > in9(s2)in10(s1)in11(s0))
> > > ...and so on until the end of the input vector
> > >
> > > The difference is that it's now the input vector that
> > "slides" and the
> > > "state" values sy that remain in the same place.
> > There's still a
> > > prologue, but you can easily get rid of it by
> > (implicitly) zero-padding
> > > the in vector during the initialization phase (start
> > with a zero vector
> > > and real one value at a time). Getting rid of the
> > epilogue is a little
> > > trickier, but I think it can be done.
> > >
> > > Cheers,
> > >
> > > Jean-Marc
> > >
> > > > Then it becomes this
> > > >
> > > > FOR in=0 to N WITH in+=8
> > > > PROC(in0(s0)) /* prolog 0 */
> > > > PROC(in0(s1) in1(s0)) /* prolog 1 */
> > > > PROC(in0(s2) in1(s1) in2(s0)) /* prolog 2 */
> > > > PROC(in0(s3) in1(s2) in2(s1) in3(s0)) /* prolog 3
> */
> > > > PROC(in0(s4) in1(s3) in2(s2) in3(s1) in4(s0)) /*
> > prolog 4 */
> > > > PROC(in0(s5) in1(s4) in2(s3) in3(s2) in4(s1)
> > in5(s0)) /* prolog 5 */
> > > > PROC(in0(s6) in1(s5) in2(s4) in3(s3) in4(s2)
> > in5(s1) in6(s0)) /*
> > > > prolog 6 */
> > > > PROC(in0(s7) in1(s6) in2(s5) in3(s4) in4(s3)
> > in5(s2) in6(s1)
> > > in7(s0))
> > > > /* fully process 8 inputs */
> > > > PROC(in0(s8) in1(s7) in2(s6) in3(s5) in4(s4)
> > in5(s3) in6(s2)
> > > in7(s1))
> > > > /* continue */
> > > > PROC(in0(s9) in1(s8) in2(s7) in3(s6) in4(s5)
> > in5(s4) in6(s3)
> > > in7(s2))
> > > > /* continue */
> > > > PROC(in0(s10) in1(s9) in2(s8) in3(s7) in4(s6)
> > in5(s5) in6(s4)
> > > in7(s3))
> > > > /* continue */
> > > > PROC(in1(s10) in2(s9) in3(s8) in4(s7) in5(s6)
> > in6(s5) in7(s4)) /*
> > > > epilog 0 */
> > > > PROC(in2(s10) in3(s9) in4(s8) in5(s7) in6(s6)
> > in7(s5)) /* epilog
> > > 1 */
> > > > PROC(in3(s10) in4(s9) in5(s8) in6(s7) in7(s6)) /*
> > epilog 2 */
> > > > PROC(in4(s10) in5(s9) in6(s8) in7(s7)) /* epilog 3
> */
> > > > PROC(in5(s10) in6(s9) in7(s8)) /* epilog 4 */
> > > > PROC(in6(s10) in7(s9)) /* epilog 5 */
> > > > PROC(in7(s10)) /* epilog 6 */
> > > > END FOR
> > > >
> > > > And
> > > > PROC(in0(s7) in1(s6) in2(s5) in3(s4) in4(s3)
> > in5(s2) in6(s1)
> > > in7(s0))
> > > > /* fully process 8 inputs */
> > > > PROC(in0(s8) in1(s7) in2(s6) in3(s5) in4(s4)
> > in5(s3) in6(s2)
> > > in7(s1))
> > > > /* continue */
> > > > PROC(in0(s9) in1(s8) in2(s7) in3(s6) in4(s5)
> > in5(s4) in6(s3)
> > > in7(s2))
> > > > /* continue */
> > > > is actually the expansion of the kernel loop
> > > > FOR i=0 TO order-6 WITH i++
> > > > PROC(in0(si+7) in1(si+6) in2(si+5) in3(si+4)
> > in4(si+3) in5(si+2)
> > > > in6(si+1) in7(si+0))
> > > > END FOR
> > > >
> > > > The worst thing is that corr_QC[] is so sensitive
> > that any extra
> > > > processing will make them wrong and propagate to the
> > next loop (next 8
> > > > inputs). state_QS[] is a little better but still
> > very sensitive. For
> > > > instance, if adding PROC(in0(s11') in1(s10) in2(s9)
> > in3(s8) in4(s7)
> > > > in5(s6) in6(s5) in7(s4)) to the kernel loop (by
> > looping one more time)
> > > > and remove epilog 0, then all final results will be
> > wrong.
> > > >
> > > > That's why the prolog and epilog cannot be saved to
> > the best of my
> > > > knowledge.
> > > >
> > > > The assembly size of
> > silk_warped_autocorrelation_FIX_neon() is about
> > > > 2,744 bytes. Compared with the C code size (about
> > 452 bytes), it's 2.3
> > > > KB larger. Considering
> > silk_warped_autocorrelation_FIX_c() is the
> > > second
> > > > place CPU heavy function in fixed-point, and our
> > testing shows up
> > > to 7%
> > > > CPU run time saving of the total encoder with this
> > optimization (at
> > > > Complexity 8), maybe we can take the I-cache burden
> > even if finally we
> > > > still cannot remove the big chunk of prolog and
> epilog.
> > > >
> > > > Thanks,
> > > > Linfeng Zhang
> > > >
> > > > On Sat, Feb 4, 2017 at 4:17 PM, Jean-Marc Valin
> > > <jmvalin at jmvalin.ca <mailto:jmvalin at jmvalin.ca>
> > <mailto:jmvalin at jmvalin.ca <mailto:jmvalin at jmvalin.ca>>
> > > > <mailto:jmvalin at jmvalin.ca
> > <mailto:jmvalin at jmvalin.ca> <mailto:jmvalin at jmvalin.ca
> > <mailto:jmvalin at jmvalin.ca>>>> wrote:
> > > >
> > > > Hi Felicia,
> > > >
> > > > I've had time to work through the math in the
> > original
> > > function and I'm
> > > > pretty sure it's possible to vectorize this
> > without the huge
> > > > prologue/epilogue.
> > > >
> > > > For the simple case where order = vector size =
> > N (but it
> > > should easily
> > > > generalize to larger order), what I came up with
> is:
> > > >
> > > > initialize X, Y, M, C to vector of zeros
> > > >
> > > > for i=0 to N+order
> > > > T = [x(i), Y(0:N-2)]
> > > > Y = M + coeff * (Y - T)
> > > > M = T
> > > > X = [x(i), X(0:N-1)]
> > > > C = C + X*Y
> > > >
> > > > I think something similar to this (assuming I
> > didn't mess up any
> > > > details) should give you the correlations in
> > vector C. Did I miss
> > > > anything?
> > > >
> > > > Cheers,
> > > >
> > > > Jean-Marc
> > > >
> > > >
> > > > On 31/01/17 12:30 PM, Felicia Lim wrote:
> > > > > Hi,
> > > > >
> > > > > Attached is a patch with arm neon
> > optimizations for
> > > > > silk_warped_autocorrelation_FIX(). Please
> review.
> > > > >
> > > > > Thanks,
> > > > > Felicia
> > > > >
> > > > >
> > > > > ______________________________
> _________________
> > > > > opus mailing list
> > > > > opus at xiph.org <mailto:opus at xiph.org>
> > <mailto:opus at xiph.org <mailto:opus at xiph.org>>
> > <mailto:opus at xiph.org <mailto:opus at xiph.org>
> > > <mailto:opus at xiph.org <mailto:opus at xiph.org>>>
> > > > > http://lists.xiph.org/mailman/listinfo/opus
> > <http://lists.xiph.org/mailman/listinfo/opus>
> > > <http://lists.xiph.org/mailman/listinfo/opus
> > <http://lists.xiph.org/mailman/listinfo/opus>>
> > > > <http://lists.xiph.org/mailman/listinfo/opus
> > <http://lists.xiph.org/mailman/listinfo/opus>
> > > <http://lists.xiph.org/mailman/listinfo/opus
> > <http://lists.xiph.org/mailman/listinfo/opus>>>
> > > > >
> > > > _______________________________________________
> > > > opus mailing list
> > > > opus at xiph.org <mailto:opus at xiph.org>
> > <mailto:opus at xiph.org <mailto:opus at xiph.org>>
> > <mailto:opus at xiph.org <mailto:opus at xiph.org>
> > > <mailto:opus at xiph.org <mailto:opus at xiph.org>>>
> > > > http://lists.xiph.org/mailman/listinfo/opus
> > <http://lists.xiph.org/mailman/listinfo/opus>
> > > <http://lists.xiph.org/mailman/listinfo/opus
> > <http://lists.xiph.org/mailman/listinfo/opus>>
> > > > <http://lists.xiph.org/mailman/listinfo/opus
> > <http://lists.xiph.org/mailman/listinfo/opus>
> > > <http://lists.xiph.org/mailman/listinfo/opus
> > <http://lists.xiph.org/mailman/listinfo/opus>>>
> > > >
> > > >
> > >
> > >
> >
> >
> >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xiph.org/pipermail/opus/attachments/20170405/edf6c309/attachment-0001.html>
More information about the opus
mailing list