<div dir="ltr">Ping for comments.<div><br></div><div>Thanks,</div><div>Linfeng</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Apr 26, 2017 at 2:15 PM, Linfeng Zhang <span dir="ltr"><<a href="mailto:linfengz@google.com" target="_blank">linfengz@google.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><span class="">On Tue, Apr 25, 2017 at 10:31 PM, Jean-Marc Valin <span dir="ltr"><<a href="mailto:jmvalin@jmvalin.ca" target="_blank">jmvalin@jmvalin.ca</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><span class="m_-6581151962023120546gmail-m_-3795214089794457909gmail-"><br>
> A_Q28 is split to 2 14-bit (or 16-bit, whatever) integers, to make the<br>
> multiplication operation within 32-bits. NEON can do 32-bit x 32-bit =<br>
> 64-bit using 'int64x2_t vmull_s32(int32x2_t a, int32x2_t b)', and it<br>
> could possibly be faster and less rounding/shifting errors than above C<br>
> code. But it may increase difficulties for other CPUs not supporting<br>
> 32-bit multiplication.<br>
<br>
</span>OK, so I'm not totally opposed to that, but it increases the<br>
testing/maintenance cost so it needs to be worth it. So the question is<br>
how much speedup can you get and how close you can make the result to<br>
the original function. If you can make the output be always within one<br>
of two LSBs of the C version, then the asm check can simply be a little<br>
bit more lax than usual. Otherwise it becomes more complicated. This<br>
isn't a function that scares me too much about going non-bitexact, but<br>
it's also not one of the big complexity costs either. In any case, let<br>
me know what you find.<br></blockquote><div><br></div></span><div>Tested the proposed NEON optimization change, it can increase the whole encoder speed by about 1% at Complexity 8 and 10 for stride = 2 (compared to the NEON optimization in the previous patch), and 0.7% at Complexity 8 and 1.1% at Complexity 10 for stride = 1 (compared to the original C code).</div><div><br></div><div>Unfortunately, the truncating difference accumulates in S[] and its difference cannot be easily bounded. The difference in out[] may somehow be bounded to 5 in my quick testing, but is not guaranteed to other inputs. So maybe comparing bit exactness with the following <span style="font-family:monospace,monospace;font-size:x-small">silk_biquad_alt_c_<wbr>MulSingleAQ28()</span> is better.</div><div><br></div><div>Please let me know the decision (whether keeping the original NEON (stride 2 only) or choosing the new NEON (both stride 1 and 2) which optimizes following <span style="font-family:monospace,monospace;font-size:x-small">silk_biquad_alt_c_<wbr>MulSingleAQ28()</span>), and I'll wrap up the patch.</div><div><br></div><div>Here attached the corresponding C code <span style="font-family:monospace,monospace;font-size:x-small">silk_biquad_alt_c_<wbr>MulSingleAQ28()</span> for your information. It uses 64-bit multiplications and is about 0.6% - 0.9% slower (for the whole encoder) than the original C code using 32-bit multiplications (for both strides and the same stride 2 unrolling).</div><div><br></div><div><div><font face="monospace, monospace" size="1">void silk_biquad_alt_c_<wbr>MulSingleAQ28(</font></div><div><font face="monospace, monospace" size="1">  const opus_int16       *in,         /* I   input signal                        */</font></div><div><font face="monospace, monospace" size="1">  const opus_int32       *B_Q28,       /* I   MA coefficients [3]                     */</font></div><div><font face="monospace, monospace" size="1">  const opus_int32       *A_Q28,       /* I   AR coefficients [2]                     */</font></div><div><font face="monospace, monospace" size="1">  opus_int32          *S,         /* I/O  State vector [2*stride]                   */</font></div><div><font face="monospace, monospace" size="1">  opus_int16          *out,        /* O   output signal                        */</font></div><div><font face="monospace, monospace" size="1">  const opus_int32       len,         /* I   signal length (must be even)                */</font></div><div><font face="monospace, monospace" size="1">  opus_int           stride        /* I   Operate on interleaved signal if > 1            */</font></div><div><font face="monospace, monospace" size="1">)</font></div><div><font face="monospace, monospace" size="1">{</font></div><div><font face="monospace, monospace" size="1">  /* DIRECT FORM II TRANSPOSED (uses 2 element state vector) */</font></div><div><font face="monospace, monospace" size="1">  opus_int  k;</font></div><div><font face="monospace, monospace" size="1"><br></font></div><div><font face="monospace, monospace" size="1">  silk_assert( ( stride == 1 ) || ( stride == 2 ) );</font></div><div><font size="1"><br></font></div><div><font face="monospace, monospace" size="1">  if( stride == 1) {</font></div><div><font face="monospace, monospace" size="1">    opus_int32 out32_Q14;</font></div><div><font face="monospace, monospace" size="1">    for( k = 0; k < len; k++ ) {</font></div><div><font face="monospace, monospace" size="1">      /* S[ 0 ], S[ 1 ]: Q12 */</font></div><div><font face="monospace, monospace" size="1">      out32_Q14 = silk_LSHIFT( silk_SMLAWB( S[ 0 ], B_Q28[ 0 ], in[ k ] ), 2 );</font></div><div><font face="monospace, monospace" size="1"><br></font></div><div><font color="#ff0000" face="monospace, monospace" size="1">      S[ 0 ] = S[ 1 ] + silk_RSHIFT_ROUND( (opus_int64)out32_Q14 * (-A_Q28[ 0 ]), 30 );</font></div><div><font face="monospace, monospace" size="1">      S[ 0 ] = silk_SMLAWB( S[ 0 ], B_Q28[ 1 ], in[ k ] );</font></div><div><font face="monospace, monospace" size="1"><br></font></div><div><font face="monospace, monospace" size="1">      <font color="#ff0000">S[ 1 ] = silk_RSHIFT_ROUND( (opus_int64)out32_Q14 * (-A_Q28[ 1 ]) , 30 );</font></font></div><div><font face="monospace, monospace" size="1">      S[ 1 ] = silk_SMLAWB( S[ 1 ], B_Q28[ 2 ], in[ k ] );</font></div><div><font face="monospace, monospace" size="1"><br></font></div><div><font face="monospace, monospace" size="1">      /* Scale back to Q0 and saturate */</font></div><div><font face="monospace, monospace" size="1">      out[ k ] = (opus_int16)silk_SAT16( silk_RSHIFT( out32_Q14 + (1<<14) - 1, 14 ) );</font></div><div><font face="monospace, monospace" size="1">    }</font></div><div><font face="monospace, monospace" size="1">  } else {</font></div><div><font face="monospace, monospace" size="1">    opus_int32 out32_Q14[ 2 ];</font></div><div><font face="monospace, monospace" size="1">    for( k = 0; k < len; k++ ) {</font></div><div><font face="monospace, monospace" size="1">      /* S[ 0 ], S[ 1 ]: Q12 */</font></div><div><font face="monospace, monospace" size="1">      out32_Q14[ 0 ] = silk_LSHIFT( silk_SMLAWB( S[ 0 ], B_Q28[ 0 ], in[ k * 2 + 0 ] ), 2 );</font></div><div><font face="monospace, monospace" size="1">      out32_Q14[ 1 ] = silk_LSHIFT( silk_SMLAWB( S[ 2 ], B_Q28[ 0 ], in[ k * 2 + 1 ] ), 2 );</font></div><div><font face="monospace, monospace" size="1"><br></font></div><div><font color="#ff0000" face="monospace, monospace" size="1">      S[ 0 ] = S[ 1 ] + silk_RSHIFT_ROUND( (opus_int64)out32_Q14[ 0 ] * (-A_Q28[ 0 ]), 30 );</font></div><div><font color="#ff0000" face="monospace, monospace" size="1">      S[ 2 ] = S[ 3 ] + silk_RSHIFT_ROUND( (opus_int64)out32_Q14[ 1 ] * (-A_Q28[ 0 ]), 30 );</font></div><div><font face="monospace, monospace" size="1">      S[ 0 ] = silk_SMLAWB( S[ 0 ], B_Q28[ 1 ], in[ k * 2 + 0 ] );</font></div><div><font face="monospace, monospace" size="1">      S[ 2 ] = silk_SMLAWB( S[ 2 ], B_Q28[ 1 ], in[ k * 2 + 1 ] );</font></div><div><font face="monospace, monospace" size="1"><br></font></div><div><font color="#ff0000" face="monospace, monospace" size="1">      S[ 1 ] = silk_RSHIFT_ROUND( (opus_int64)out32_Q14[ 0 ] * (-A_Q28[ 1 ]), 30 );</font></div><div><font color="#ff0000" face="monospace, monospace" size="1">      S[ 3 ] = silk_RSHIFT_ROUND( (opus_int64)out32_Q14[ 1 ] * (-A_Q28[ 1 ]), 30 );</font></div><div><font face="monospace, monospace" size="1">      S[ 1 ] = silk_SMLAWB( S[ 1 ], B_Q28[ 2 ], in[ k * 2 + 0 ] );</font></div><div><font face="monospace, monospace" size="1">      S[ 3 ] = silk_SMLAWB( S[ 3 ], B_Q28[ 2 ], in[ k * 2 + 1 ] );</font></div><div><font face="monospace, monospace" size="1"><br></font></div><div><font face="monospace, monospace" size="1">      /* Scale back to Q0 and saturate */</font></div><div><font face="monospace, monospace" size="1">      out[ k * 2 + 0 ] = (opus_int16)silk_SAT16( silk_RSHIFT( out32_Q14[ 0 ] + (1<<14) - 1, 14 ) );</font></div><div><font face="monospace, monospace" size="1">      out[ k * 2 + 1 ] = (opus_int16)silk_SAT16( silk_RSHIFT( out32_Q14[ 1 ] + (1<<14) - 1, 14 ) );</font></div><div><font face="monospace, monospace" size="1">    }</font></div><div><font face="monospace, monospace" size="1">  }</font></div><div><font face="monospace, monospace" size="1">}</font></div></div><div><br></div><div>Here is the NEON kernels which uses <span style="font-family:monospace,monospace;font-size:x-small">vqrdmulh_lane_s32()</span><font face="arial, helvetica, sans-serif"> to do the multiplication and rounding, where A_Q28_s32x{2,4} stores doubled -A_Q28[]:</font></div><div><br></div><div><div><div><font face="monospace, monospace" size="1">static inline void silk_biquad_alt_stride1_kernel<wbr>(const int32x2_t A_Q28_s32x2, const int32x4_t t_s32x4, int32x2_t *S_s32x2, int32x2_t *out32_Q14_s32x2)</font></div><div><font face="monospace, monospace" size="1">{</font></div><div><font face="monospace, monospace" size="1">  int32x2_t t_s32x2;</font></div><div><font face="monospace, monospace" size="1"><br></font></div><div><font face="monospace, monospace" size="1">  *out32_Q14_s32x2 = vadd_s32(*S_s32x2, vget_low_s32(t_s32x4));               /* silk_SMLAWB( S[ 0 ], B_Q28[ 0 ], in[ k ] )                 */</font></div><div><font face="monospace, monospace" size="1">  *S_s32x2     = vreinterpret_s32_u64(vshr_n_u6<wbr>4(vreinterpret_u64_s32(*S_s32x<wbr>2), 32)); /* S[ 0 ] = S[ 1 ]; S[ 1 ] = 0;                        */</font></div><div><font face="monospace, monospace" size="1">  *out32_Q14_s32x2 = vshl_n_s32(*out32_Q14_s32x2, 2);                    /* out32_Q14 = silk_LSHIFT( silk_SMLAWB( S[ 0 ], B_Q28[ 0 ], in[ k ] ), 2 ); */</font></div><div><font face="monospace, monospace" size="1">  t_s32x2      = <font color="#ff0000">vqrdmulh_lane_s32</font>(A_Q28_s32x2, *out32_Q14_s32x2, 0);          /* silk_RSHIFT_ROUND( (opus_int64)out32_Q14 * (-A_Q28[ {0,1} ]), 30 )     */</font></div><div><font face="monospace, monospace" size="1">  *S_s32x2     = vadd_s32(*S_s32x2, t_s32x2);                      /* S[ {0,1} ] = {S[ 1 ],0} + silk_RSHIFT_ROUND( );              */</font></div><div><font face="monospace, monospace" size="1">  *S_s32x2     = vadd_s32(*S_s32x2, vget_high_s32(t_s32x4));              /* S[ {0,1} ] = silk_SMLAWB( S[ {0,1} ], B_Q28[ {1,2} ], in[ k ] );      */</font></div><div><font face="monospace, monospace" size="1">}</font></div><div><font face="monospace, monospace" size="1"><br></font></div><div><font face="monospace, monospace" size="1">static inline void silk_biquad_alt_stride2_kernel<wbr>(const int32x4_t A_Q28_s32x4, const int32x4_t B_Q28_s32x4, const int32x2_t t_s32x2, const int32x4_t inval_s32x4, int32x4_t *S_s32x4, int32x2_t *out32_Q14_s32x2)</font></div><div><font face="monospace, monospace" size="1">{</font></div><div><font face="monospace, monospace" size="1">  int32x4_t t_s32x4, out32_Q14_s32x4;</font></div><div><font face="monospace, monospace" size="1"><br></font></div><div><font face="monospace, monospace" size="1">  *out32_Q14_s32x2 = vadd_s32(vget_low_s32(*S_s32x4<wbr>), t_s32x2);       /* silk_SMLAWB( S{0,1}, B_Q28[ 0 ], in[ k * 2 + {0,1} ] )                    */</font></div><div><font face="monospace, monospace" size="1">  *S_s32x4     = vcombine_s32(vget_high_s32(*S_<wbr>s32x4), vdup_n_s32(0)); /* S{0,1} = S{2,3}; S{2,3} = 0;                                 */</font></div><div><font face="monospace, monospace" size="1">  *out32_Q14_s32x2 = vshl_n_s32(*out32_Q14_s32x2, 2);            /* out32_Q14_{0,1} = silk_LSHIFT( silk_SMLAWB( S{0,1}, B_Q28[ 0 ], in[ k * 2 + {0,1} ] ), 2 );  */</font></div><div><font face="monospace, monospace" size="1">  out32_Q14_s32x4  = vcombine_s32(*out32_Q14_s32x2, *out32_Q14_s32x2);   /* out32_Q14_{0,1,0,1}                                      */</font></div><div><font face="monospace, monospace" size="1">  t_s32x4      = <font color="#ff0000">vqrdmulhq_s32</font>(out32_Q14_s32x4, A_Q28_s32x4);      /* silk_RSHIFT_ROUND( (opus_int64)out32_Q14[ {0,1,0,1} ] * (-A_Q28[ {0,0,1,1} ]), 30 )      */</font></div><div><font face="monospace, monospace" size="1">  *S_s32x4     = vaddq_s32(*S_s32x4, t_s32x4);             /* S[ {0,1,2,3} ] = {S[ {2,3} ],0,0} + silk_RSHIFT_ROUND( );                   */</font></div><div><font face="monospace, monospace" size="1">  t_s32x4      = vqdmulhq_s32(inval_s32x4, B_Q28_s32x4);        /* silk_SMULWB(B_Q28[ {1,1,2,2} ], in[ k * 2 + {0,1,0,1} ] )                   */</font></div><div><font face="monospace, monospace" size="1">  *S_s32x4     = vaddq_s32(*S_s32x4, t_s32x4);             /* S[ {0,1,2,3} ] = silk_SMLAWB( S[ {0,1,2,3} ], B_Q28[ {1,1,2,2} ], in[ k * 2 + {0,1,0,1} ] ); */</font></div><div><font face="monospace, monospace" size="1">}</font></div></div></div><div></div></div><br></div><div class="gmail_extra">Thanks,</div><div class="gmail_extra">Linfeng</div></div>
</blockquote></div><br></div>