<div dir="ltr">Attached a new patch, which fixes a compiling error.<div><br></div><div>Thanks,</div><div>Linfeng</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Apr 11, 2017 at 4:07 PM, Linfeng Zhang <span dir="ltr"><<a href="mailto:linfengz@google.com" target="_blank">linfengz@google.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div>Hi Jean-Marc,</div><div><br></div><div>Thanks for your suggestions!</div><div><br></div><div>I attached the new patch, with inlined reply below.</div><div><br></div>Thanks,<div>Linfeng<br><div class="gmail_extra"><br><div class="gmail_quote"><span class="">On Thu, Apr 6, 2017 at 12:55 PM, Jean-Marc Valin <span dir="ltr"><<a href="mailto:jmvalin@jmvalin.ca" target="_blank">jmvalin@jmvalin.ca</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">I did some profiling on a Cortex A57 and I've been seeing slightly less<br>

improvement than you're reporting, more like 3.5% at complexity 8. It<br>

appears that the warped autocorrelation function itself is only faster<br>

by a factor of about 1.35. That's a bit surprising considering I see<br>

nothing obviously wrong with the code.<br></blockquote><div><br></div></span><div>Speed test the new patch, and got about 7.8% whole encoder speed gain with complexity 8 on my Acre Chromebook.</div><div>Here is my configure:</div><div>./configure --build x86_64-unknown-linux-gnu --host arm-linux-gnueabihf --disable-assertions --enable-fixed-point --enable-intrinsics CFLAGS=-O3 --disable-shared<br></div><div><br></div><div>The testing speech file may also change the speed results.</div><span class=""><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">1) In calc_state(), rather than splitting the multiply in two<br>

instructions, you may be able to simply shift the warping left 16 bits,<br>

then use the Neon instruction that does a*b>>32 (i.e. the one that<br>

computes the top bits of a 32x32 multiply)<br></blockquote><div><br></div></span><div>Done.</div><span class=""><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

2) If the problem is with the movs at the end of each iteration, then<br>

you should be able to get rid of them by unrolling by a factor of two.<br></blockquote><div><br></div></span><div><div>We did this previously and get some gains, but the code size is much bigger. So we abandoned. Tested again on the new code and got no speed gains.</div><span class=""><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">3) It seems likely that you have significant register spill going on due<br>to processing up to 24 "taps" at the same time. If that's causing a<br>slowdown, then it should be possible to do the processing in "sections".<br>By that, I mean that you can implement (e.g.) an order-8 "kernel" that<br>computes the correlations and also outputs the last element of<br>state_QS_s32x4[0][0] back to input_QS[], so that it can be used to<br>compute a new secion.<br></blockquote><div><br></div></span><div>Done. The speed is almost identical (slightly slower), however the extra bonus is code size saving.</div></div><span class=""><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">4) It's a minor detail, but the last element of corr_QC[] that's not<br>

currently vectorized could simply be vectorized independently outside<br>

the loop (and it's the same for all orders).<br></blockquote><div><br></div></span><div>Done.</div></div></div></div></div>

</blockquote></div><br></div>