<html>
  <head>
    <meta content="text/html; charset=utf-8" http-equiv="Content-Type">
  </head>
  <body bgcolor="#FFFFFF" text="#000000">
    It's good to know that the compiler is dealing with this properly. I
    withdraw my suggestion.<br>
    <br>
    WebRTC dealt with this issue using conditional compilation depending
    on whether the "__aarch64__" symbol was defined, but that was more
    than a year ago with older compilers.<br>
    <br>
    This all would have been much simpler if ARM had just supplied ARMv7
    macros for the "high" intrinsics in their arm_neon.h header.<br>
    <br>
    <br>
    <div class="moz-cite-prefix">On 11/23/2015 11:11 AM, Jonathan Lennox
      wrote:<br>
    </div>
    <blockquote
      cite="mid:46FD89E2-AB0D-485F-94CE-F9A815C3BC55@vidyo.com"
      type="cite">
      <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
      <br class="">
      <div>
        <blockquote type="cite" class="">
          <div class="">On Nov 23, 2015, at 12:04 PM, John Ridges &lt;<a
              moz-do-not-send="true" href="mailto:jridges@masque.com"
              class=""><a class="moz-txt-link-abbreviated" href="mailto:jridges@masque.com">jridges@masque.com</a></a>&gt; wrote:</div>
          <br class="Apple-interchange-newline">
          <div class="">Hi Jonathan.<br class="">
            <br class="">
            I really, really hate to bring this up this late in the
            game, but I just noticed that your NEON code doesn't use any
            of the "high" intrinsics for ARM64, e.g. instead of:<br
              class="">
            <br class="">
            int32x4_t coef1 = vmovl_s16(vget_high_s16(coef16));<br
              class="">
            <br class="">
            you could use:<br class="">
            <br class="">
            int32x4_t coef1 = vmovl_high_s16(coef16);<br class="">
            <br class="">
            and instead of:<br class="">
            <br class="">
            int64x2_t b1 = vmlal_s32(b0, vget_high_s32(a0),
            vget_high_s32(coef0));<br class="">
            <br class="">
            you could use:<br class="">
            <br class="">
            int64x2_t b1 = vmlal_high_s32(b0, a0, coef0);<br class="">
            <br class="">
            and instead of:<br class="">
            <br class="">
            int64x1_t c = vadd_s64(vget_low_s64(b3), vget_high_s64(b3));<br
              class="">
            int64x1_t cS = vshr_n_s64(c, 16);<br class="">
            int32x2_t d = vreinterpret_s32_s64(cS);<br class="">
            out = vget_lane_s32(d, 0);<br class="">
            <br class="">
            you could use:<br class="">
            <br class="">
            out = (opus_int32)(vaddvq_s64(b3) &gt;&gt; 16);<br class="">
            <br class="">
            I understand that ARM added these intrinsics because
            "vget_high_xxx" generates an instruction in ARM64, and isn't
            just free the way it was in ARMv7 ("vget_low_xxx" is of
            course still free on both platforms).</div>
        </blockquote>
        <br class="">
      </div>
      <div>Other than the one-intrinsic optimizations, I’d rather keep
        the Neon intrinsics code compilable on ARMv7 as well as ARM64 —
        the Neon code is a performance boost for both platforms, and I’d
        rather not litter it with #ifdef’s unless there’s a large
        difference between the platforms.</div>
      <div><br class="">
      </div>
      <div>It looks like Clang (the version in Xcode 7.1.1, at least) is
        smart enough to optimize the first two operations you mention,
        figuring out sshll2 and smlal2 properly, though the third causes
        a gratuitous extra “ext.16b” to be generated.  I’ve filed a
        missed-optimization bug on Clang for the latter.</div>
      <div><br class="">
      </div>
      <div>Here’s the code it generates:</div>
      <div><br class="">
      </div>
      <div>
        <div style="margin: 0px; font-size: 11px; font-family: Menlo;"
          class="">_silk_NSQ_noise_shape_feedback_loop_neon:</div>
        <div style="margin: 0px; font-size: 11px; font-family: Menlo;"
          class="">000000000000004c        ldr      w9, [x0]</div>
        <div style="margin: 0px; font-size: 11px; font-family: Menlo;"
          class="">0000000000000050        cmp      w3, #8</div>
        <div style="margin: 0px; font-size: 11px; font-family: Menlo;"
          class="">0000000000000054        b.ne    0x9c</div>
        <div style="margin: 0px; font-size: 11px; font-family: Menlo;"
          class="">0000000000000058        dup.4s  v0, w9</div>
        <div style="margin: 0px; font-size: 11px; font-family: Menlo;"
          class="">000000000000005c        ldr      q1, [x1]</div>
        <div style="margin: 0px; font-size: 11px; font-family: Menlo;"
          class="">0000000000000060        ext.16b v0, v0, v1, #12</div>
        <div style="margin: 0px; font-size: 11px; font-family: Menlo;"
          class="">0000000000000064        ldur    q1, [x1, #12]</div>
        <div style="margin: 0px; font-size: 11px; font-family: Menlo;"
          class="">0000000000000068        ldr      q2, [x2]</div>
        <div style="margin: 0px; font-size: 11px; font-family: Menlo;"
          class="">000000000000006c        sshll.4s        v3, v2, #0</div>
        <div style="margin: 0px; font-size: 11px; font-family: Menlo;"
          class="">0000000000000070        sshll2.4s       v2, v2, #0</div>
        <div style="margin: 0px; font-size: 11px; font-family: Menlo;"
          class="">0000000000000074        smull.2d        v4, v0, v3</div>
        <div style="margin: 0px; font-size: 11px; font-family: Menlo;"
          class="">0000000000000078        smlal2.2d       v4, v0, v3</div>
        <div style="margin: 0px; font-size: 11px; font-family: Menlo;"
          class="">000000000000007c        smlal.2d        v4, v1, v2</div>
        <div style="margin: 0px; font-size: 11px; font-family: Menlo;"
          class="">0000000000000080        smlal2.2d       v4, v1, v2</div>
        <div style="margin: 0px; font-size: 11px; font-family: Menlo;"
          class="">0000000000000084        ext.16b v2, v4, v4, #8</div>
        <div style="margin: 0px; font-size: 11px; font-family: Menlo;"
          class="">0000000000000088        add     d2, d4, d2</div>
        <div style="margin: 0px; font-size: 11px; font-family: Menlo;"
          class="">000000000000008c        sshr    d2, d2, #16</div>
        <div style="margin: 0px; font-size: 11px; font-family: Menlo;"
          class="">0000000000000090        fmov    w0, s2</div>
        <div style="margin: 0px; font-size: 11px; font-family: Menlo;"
          class="">0000000000000094        stp      q0, q1, [x1]</div>
        <div style="margin: 0px; font-size: 11px; font-family: Menlo;"
          class="">0000000000000098        ret</div>
        <div class=""><br class="">
        </div>
        <div class="">(Non-vectorized code for non-order-8 omitted.)</div>
      </div>
    </blockquote>
    <br>
  </body>
</html>