<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Mon, Apr 24, 2017 at 5:52 PM, Jean-Marc Valin <span dir="ltr"><<a href="mailto:jmvalin@jmvalin.ca" target="_blank">jmvalin@jmvalin.ca</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><span class="gmail-">On 24/04/17 08:03 PM, Linfeng Zhang wrote:<br>
> Tested on my chromebook, when stride (channel) == 1, the optimization<br>
> has no gain compared with C function.<br>
<br>
</span>You mean that the Neon code is the same speed as the C code for<br>
stride==1? This is not terribly surprising for an IIRC filter.<br></blockquote><div><br></div><div>Yes</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<span class="gmail-"><br>
> When stride (channel) == 2, the optimization is 1.2%-1.8% faster (1.6%<br>
> at Complexity 8) compared with C function.<br>
<br>
</span>Is that gain due to Neon or simply due to computing two channels in<br>
parallel? For example, if you make a special case in the C code to<br>
handle both channels in the same loop, what kind of performance do you get?<br></blockquote><div><br></div><div>Tested Complexity 8, it's half half, i.e., 0.8% faster if handling both channels in the same loop in C, and then additional 0.8% faster using NEON.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<span class="gmail-"><br>
> Please let me know and I can remove the optimization of stride 1 case.<br>
<br>
</span>Yeah, if there's Neon code that provides no improvement over C, let's<br>
stick with C. And if you manage to write C code that has the same<br>
performance as the Neon code, then that would also be better (both<br>
easier to maintain and more portable).<br></blockquote><div><br></div><div>Will do.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<span class="gmail-"><br>
> If it's allowed to skip the split of A_Q28 and replace by 32-bit<br>
> multiplication (result is 64-bit), probably it could be faster on NEON.<br>
> This may change the encoder results because of different order of<br>
> adding, shifting and rounding.<br>
<br>
</span>I'm not sure what you mean for that.<br></blockquote><div><br></div><div><div><font face="monospace, monospace"> /* Negate A_Q28 values and split in two parts */</font></div><div><font face="monospace, monospace"> A0_L_Q28 = ( -A_Q28[ 0 ] ) & 0x00003FFF; /* lower part */</font></div><div><font face="monospace, monospace"> A0_U_Q28 = silk_RSHIFT( -A_Q28[ 0 ], 14 ); /* upper part */</font></div><div><font face="monospace, monospace"> A1_L_Q28 = ( -A_Q28[ 1 ] ) & 0x00003FFF; /* lower part */</font></div><div><font face="monospace, monospace"> A1_U_Q28 = silk_RSHIFT( -A_Q28[ 1 ], 14 ); /* upper part */</font></div></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace"> ...</font></div><div><font face="monospace, monospace"><br></font></div><div><div><font face="monospace, monospace"> S[ 0 ] = S[1] + silk_RSHIFT_ROUND( silk_SMULWB( out32_Q14, A0_L_Q28 ), 14 );</font></div><div><font face="monospace, monospace"> S[ 0 ] = silk_SMLAWB( S[ 0 ], out32_Q14, A0_U_Q28 );</font></div><div><font face="monospace, monospace"> S[ 0 ] = silk_SMLAWB( S[ 0 ], B_Q28[ 1 ], inval);</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace"> S[ 1 ] = silk_RSHIFT_ROUND( silk_SMULWB( out32_Q14, A1_L_Q28 ), 14 );</font></div><div><font face="monospace, monospace"> S[ 1 ] = silk_SMLAWB( S[ 1 ], out32_Q14, A1_U_Q28 );</font></div><div><font face="monospace, monospace"> S[ 1 ] = silk_SMLAWB( S[ 1 ], B_Q28[ 2 ], inval );</font></div></div><div><br></div><div>A_Q28 is split to 2 14-bit (or 16-bit, whatever) integers, to make the multiplication operation within 32-bits. NEON can do 32-bit x 32-bit = 64-bit using 'int64x2_t vmull_s32(int32x2_t a, int32x2_t b)', and it could possibly be faster and less rounding/shifting errors than above C code. But it may increase difficulties for other CPUs not supporting 32-bit multiplication.</div><div><br></div><div>Thanks,</div><div>Linfeng</div></div></div></div>