Nit: in dual_inner_prod_sse, why not do both horizontal sums at the same time? As in: xsum1 = _mm_add_ps(_mm_movelh_ps(xsum1, xsum2), _mm_movehl_ps(xsum2, xsum1)); xsum1 = _mm_add_ps(xsum1, _mm_shuffle_ps(xsum1, xsum1, 0xf5)); _mm_store_ss(xy1, xsum1); _mm_store_ss(xy2, _mm_movehl_ps(xsum1, xsum1)); --John