[opus] [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON
Linfeng Zhang
linfengz at google.com
Wed Apr 5 16:14:02 UTC 2017
I attached a new patch with small cleanup (disassembly is identical as the
last patch). We have done the same internal testing as usual.
Also, attached 2 failed temporary versions which try to reduce code size
(just for code review reference purpose).
The new patch of silk_warped_autocorrelation_FIX_neon() has a code size of
3,228 bytes (with gcc).
smaller_slower.c has a code size of 2,304 bytes, but the encoder is about
1.8% - 2.7% slower.
smallest_slowest.c has a code size of 1,656 bytes, but the encoder is about
2.3% - 3.6% slower.
Thanks,
Linfeng
On Mon, Apr 3, 2017 at 3:01 PM, Linfeng Zhang <linfengz at google.com> wrote:
> Hi Jean-Marc,
>
> Attached is the silk_warped_autocorrelation_FIX_neon() which implements
> your idea.
>
> Speed improvement vs the previous optimization:
>
> Complexity 0-4: Doesn't call this function. Complexity 5: 2.1% (order =
> 16) Complexity 6: 1.0% (order = 20) Complexity 8: 0.1% (order = 24)
> Complexity 10: 0.1% (order = 24)
>
> Code size of silk_warped_autocorrelation_FIX_neon() changes from 2,644
> bytes to 3,228 bytes.
>
> The reason of larger code size is that the new optimization specializes
> order 16, 20 and 24. If only keeping order 24 specialization, the code
> still works and the code size is smaller, but the encoder speed will drop
> 4.0% for Complexity 5 and 2.0% for Complexity 6. Anyway, the new code is
> easier to understand and maintain.
>
> Thanks,
>
> Linfeng
>
> On Tue, Feb 7, 2017 at 8:58 AM, Linfeng Zhang <linfengz at google.com> wrote:
>
>> Hi Jean-Marc,
>>
>> Thanks for your suggestions. Will get back to you once we have some
>> updates.
>>
>> Linfeng
>>
>> On Mon, Feb 6, 2017 at 5:47 PM, Jean-Marc Valin <jmvalin at jmvalin.ca>
>> wrote:
>>
>>> Hi Linfeng,
>>>
>>> On 06/02/17 07:18 PM, Linfeng Zhang wrote:
>>> > This is a great idea. But the order (psEncC->shapingLPCOrder) can be
>>> > configured to 12, 14, 16, 20 and 24 according to complexity parameter.
>>> >
>>> > It's hard to get a universal function to handle all these orders
>>> > efficiently. Any suggestions?
>>>
>>> I can think of two ways of handling larger orders. The obvious one is
>>> simply to add an inner loop of the form:
>>> for (i=0;i<order;i+=VECTOR_SIZE)
>>> I think what may be more efficient is to simply have a small "order-N"
>>> (N=4 or 8) kernel that not only computes the correlation of order N, but
>>> then spits out the signal after the N-stage all-pass is applied. The
>>> kernel would look like:
>>>
>>> void autocorr_kernel4(int *corr, int *orig, int *input, int *output, int
>>> len) {
>>> /* Implement vectorized order-4 filter (could also be order 8)
>>> as described in previous email and outputs the filtered signal.
>>> */
>>> }
>>>
>>> and then the full function would run the kernel multiple times and look
>>> like:
>>>
>>> void full_autocorr(int *corr, int *orig, int len, int order) {
>>> int i;
>>> int tmp[MAX_SIZE];
>>> int *in = orig;
>>> for (i=0;i<order;i+=4) {
>>> autocorr_kernel4(corr+i, orig, in, tmp, len);
>>> /* Make subsequent calls use the filtered signal as input. */
>>> in = tmp;
>>> }
>>> }
>>>
>>> I think the should not only reduce/eliminate the prologue/epilogue
>>> problem, but it should also be more efficient since almost all vectors
>>> processed would use the full size.
>>>
>>> Maybe a third option (not sure it's a good idea, but still mentioning
>>> it) would be to have a function that hardcodes order=24 and discards the
>>> larger values that aren't needed. Since the smallest order seems to be
>>> 16, it wouldn't be much of a waste and the code might end up running
>>> faster for the higher orders.
>>>
>>> Cheers,
>>>
>>> Jean-Marc
>>>
>>>
>>> > Thanks,
>>> > Linfeng
>>> >
>>> > On Mon, Feb 6, 2017 at 12:40 PM, Jean-Marc Valin <jmvalin at jmvalin.ca
>>> > <mailto:jmvalin at jmvalin.ca>> wrote:
>>> >
>>> > Hi Linfeng,
>>> >
>>> > On 06/02/17 02:51 PM, Linfeng Zhang wrote:
>>> > > However, the critical thing is that all the states in each stage
>>> when
>>> > > processing input[i] are reused by the next input[i+1]. That is
>>> > > input[i+1] must wait input[i] for 1 stage, and input[i+2] must
>>> wait
>>> > > input[i+1] for 1 stage, etc.
>>> >
>>> > That is indeed the tricky part... and the one I think you could do
>>> > slightly differently. If you approach the problem in terms of
>>> computing
>>> > chunks of the inputs N samples at a time, then indeed the approach
>>> you
>>> > are describing is the only solution. What I was proposing though
>>> is to
>>> > instead chop the "order" in chunks of N. Using your notation, you
>>> would
>>> > be doing:
>>> >
>>> > PROC(
>>> in0(s0))
>>> > PROC( in0(s1)
>>> in1(s0))
>>> > PROC( in0(s2) in1(s1)
>>> in2(s0))
>>> > PROC( in0(s3) in1(s2) in2(s1)
>>> in3(s0))
>>> > PROC( in0(s4) in1(s3) in2(s2) in3(s1)
>>> in4(s0))
>>> > PROC( in0(s5) in1(s4) in2(s3) in3(s2) in4(s1)
>>> in5(s0))
>>> > PROC( in0(s6) in1(s5) in2(s4) in3(s3) in4(s2) in5(s1)
>>> in6(s0))
>>> > PROC(in0(s7) in1(s6) in2(s5) in3(s4) in4(s3) in5(s2) in6(s1)
>>> in7(s0))
>>> > PROC(in1(s7) in2(s6) in3(s5) in4(s4) in5(s3) in6(s2) in7(s1)
>>> in8(s0))
>>> > PROC(in2(s7) in3(s6) in4(s5) in5(s4) in6(s3) in7(s2) in8(s1)
>>> in9(s0))
>>> > PROC(in3(s7) in4(s6) in5(s5) in6(s4) in7(s3) in8(s2)
>>> in9(s1)in10(s0))
>>> > PROC(in4(s7) in5(s6) in6(s5) in7(s4) in8(s3)
>>> in9(s2)in10(s1)in11(s0))
>>> > ...and so on until the end of the input vector
>>> >
>>> > The difference is that it's now the input vector that "slides" and
>>> the
>>> > "state" values sy that remain in the same place. There's still a
>>> > prologue, but you can easily get rid of it by (implicitly)
>>> zero-padding
>>> > the in vector during the initialization phase (start with a zero
>>> vector
>>> > and real one value at a time). Getting rid of the epilogue is a
>>> little
>>> > trickier, but I think it can be done.
>>> >
>>> > Cheers,
>>> >
>>> > Jean-Marc
>>> >
>>> > > Then it becomes this
>>> > >
>>> > > FOR in=0 to N WITH in+=8
>>> > > PROC(in0(s0)) /* prolog 0 */
>>> > > PROC(in0(s1) in1(s0)) /* prolog 1 */
>>> > > PROC(in0(s2) in1(s1) in2(s0)) /* prolog 2 */
>>> > > PROC(in0(s3) in1(s2) in2(s1) in3(s0)) /* prolog 3 */
>>> > > PROC(in0(s4) in1(s3) in2(s2) in3(s1) in4(s0)) /* prolog 4 */
>>> > > PROC(in0(s5) in1(s4) in2(s3) in3(s2) in4(s1) in5(s0)) /*
>>> prolog 5 */
>>> > > PROC(in0(s6) in1(s5) in2(s4) in3(s3) in4(s2) in5(s1) in6(s0))
>>> /*
>>> > > prolog 6 */
>>> > > PROC(in0(s7) in1(s6) in2(s5) in3(s4) in4(s3) in5(s2) in6(s1)
>>> > in7(s0))
>>> > > /* fully process 8 inputs */
>>> > > PROC(in0(s8) in1(s7) in2(s6) in3(s5) in4(s4) in5(s3) in6(s2)
>>> > in7(s1))
>>> > > /* continue */
>>> > > PROC(in0(s9) in1(s8) in2(s7) in3(s6) in4(s5) in5(s4) in6(s3)
>>> > in7(s2))
>>> > > /* continue */
>>> > > PROC(in0(s10) in1(s9) in2(s8) in3(s7) in4(s6) in5(s5) in6(s4)
>>> > in7(s3))
>>> > > /* continue */
>>> > > PROC(in1(s10) in2(s9) in3(s8) in4(s7) in5(s6) in6(s5) in7(s4))
>>> /*
>>> > > epilog 0 */
>>> > > PROC(in2(s10) in3(s9) in4(s8) in5(s7) in6(s6) in7(s5)) /*
>>> epilog
>>> > 1 */
>>> > > PROC(in3(s10) in4(s9) in5(s8) in6(s7) in7(s6)) /* epilog 2 */
>>> > > PROC(in4(s10) in5(s9) in6(s8) in7(s7)) /* epilog 3 */
>>> > > PROC(in5(s10) in6(s9) in7(s8)) /* epilog 4 */
>>> > > PROC(in6(s10) in7(s9)) /* epilog 5 */
>>> > > PROC(in7(s10)) /* epilog 6 */
>>> > > END FOR
>>> > >
>>> > > And
>>> > > PROC(in0(s7) in1(s6) in2(s5) in3(s4) in4(s3) in5(s2) in6(s1)
>>> > in7(s0))
>>> > > /* fully process 8 inputs */
>>> > > PROC(in0(s8) in1(s7) in2(s6) in3(s5) in4(s4) in5(s3) in6(s2)
>>> > in7(s1))
>>> > > /* continue */
>>> > > PROC(in0(s9) in1(s8) in2(s7) in3(s6) in4(s5) in5(s4) in6(s3)
>>> > in7(s2))
>>> > > /* continue */
>>> > > is actually the expansion of the kernel loop
>>> > > FOR i=0 TO order-6 WITH i++
>>> > > PROC(in0(si+7) in1(si+6) in2(si+5) in3(si+4) in4(si+3)
>>> in5(si+2)
>>> > > in6(si+1) in7(si+0))
>>> > > END FOR
>>> > >
>>> > > The worst thing is that corr_QC[] is so sensitive that any extra
>>> > > processing will make them wrong and propagate to the next loop
>>> (next 8
>>> > > inputs). state_QS[] is a little better but still very sensitive.
>>> For
>>> > > instance, if adding PROC(in0(s11') in1(s10) in2(s9) in3(s8)
>>> in4(s7)
>>> > > in5(s6) in6(s5) in7(s4)) to the kernel loop (by looping one more
>>> time)
>>> > > and remove epilog 0, then all final results will be wrong.
>>> > >
>>> > > That's why the prolog and epilog cannot be saved to the best of
>>> my
>>> > > knowledge.
>>> > >
>>> > > The assembly size of silk_warped_autocorrelation_FIX_neon() is
>>> about
>>> > > 2,744 bytes. Compared with the C code size (about 452 bytes),
>>> it's 2.3
>>> > > KB larger. Considering silk_warped_autocorrelation_FIX_c() is
>>> the
>>> > second
>>> > > place CPU heavy function in fixed-point, and our testing shows up
>>> > to 7%
>>> > > CPU run time saving of the total encoder with this optimization
>>> (at
>>> > > Complexity 8), maybe we can take the I-cache burden even if
>>> finally we
>>> > > still cannot remove the big chunk of prolog and epilog.
>>> > >
>>> > > Thanks,
>>> > > Linfeng Zhang
>>> > >
>>> > > On Sat, Feb 4, 2017 at 4:17 PM, Jean-Marc Valin
>>> > <jmvalin at jmvalin.ca <mailto:jmvalin at jmvalin.ca>
>>> > > <mailto:jmvalin at jmvalin.ca <mailto:jmvalin at jmvalin.ca>>> wrote:
>>> > >
>>> > > Hi Felicia,
>>> > >
>>> > > I've had time to work through the math in the original
>>> > function and I'm
>>> > > pretty sure it's possible to vectorize this without the huge
>>> > > prologue/epilogue.
>>> > >
>>> > > For the simple case where order = vector size = N (but it
>>> > should easily
>>> > > generalize to larger order), what I came up with is:
>>> > >
>>> > > initialize X, Y, M, C to vector of zeros
>>> > >
>>> > > for i=0 to N+order
>>> > > T = [x(i), Y(0:N-2)]
>>> > > Y = M + coeff * (Y - T)
>>> > > M = T
>>> > > X = [x(i), X(0:N-1)]
>>> > > C = C + X*Y
>>> > >
>>> > > I think something similar to this (assuming I didn't mess up
>>> any
>>> > > details) should give you the correlations in vector C. Did I
>>> miss
>>> > > anything?
>>> > >
>>> > > Cheers,
>>> > >
>>> > > Jean-Marc
>>> > >
>>> > >
>>> > > On 31/01/17 12:30 PM, Felicia Lim wrote:
>>> > > > Hi,
>>> > > >
>>> > > > Attached is a patch with arm neon optimizations for
>>> > > > silk_warped_autocorrelation_FIX(). Please review.
>>> > > >
>>> > > > Thanks,
>>> > > > Felicia
>>> > > >
>>> > > >
>>> > > > _______________________________________________
>>> > > > opus mailing list
>>> > > > opus at xiph.org <mailto:opus at xiph.org> <mailto:opus at xiph.org
>>> > <mailto:opus at xiph.org>>
>>> > > > http://lists.xiph.org/mailman/listinfo/opus
>>> > <http://lists.xiph.org/mailman/listinfo/opus>
>>> > > <http://lists.xiph.org/mailman/listinfo/opus
>>> > <http://lists.xiph.org/mailman/listinfo/opus>>
>>> > > >
>>> > > _______________________________________________
>>> > > opus mailing list
>>> > > opus at xiph.org <mailto:opus at xiph.org> <mailto:opus at xiph.org
>>> > <mailto:opus at xiph.org>>
>>> > > http://lists.xiph.org/mailman/listinfo/opus
>>> > <http://lists.xiph.org/mailman/listinfo/opus>
>>> > > <http://lists.xiph.org/mailman/listinfo/opus
>>> > <http://lists.xiph.org/mailman/listinfo/opus>>
>>> > >
>>> > >
>>> >
>>> >
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xiph.org/pipermail/opus/attachments/20170405/41265d1d/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-Optimize-silk_warped_autocorrelation_FIX-for-ARM-NEO.patch
Type: text/x-patch
Size: 34156 bytes
Desc: not available
URL: <http://lists.xiph.org/pipermail/opus/attachments/20170405/41265d1d/attachment-0001.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smaller_slower.c
Type: text/x-csrc
Size: 11374 bytes
Desc: not available
URL: <http://lists.xiph.org/pipermail/opus/attachments/20170405/41265d1d/attachment-0002.c>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smallest_slowest.c
Type: text/x-csrc
Size: 10850 bytes
Desc: not available
URL: <http://lists.xiph.org/pipermail/opus/attachments/20170405/41265d1d/attachment-0003.c>
More information about the opus
mailing list