[Theora-dev] mmx optimization
Ottavio Campana
ottavio.campana at dei.unipd.it
Tue Apr 19 06:27:30 PDT 2005
Hi,
I've been giving a look at the archives of the mailing list and I've
seen that you have rewritten a lot of functions using mmx to make them
faster.
I'm currently trying to optimize some code, but I'm have some problems,
because I work with 16 bit per component and not 8 like theora. I know
that it is off topic, but I'm posting to ask you a little help.
I've got this function that calculates the sad:
si32
sad_4x4 (macroblock_t * mb, ui8 x, ui8 y)
{
ui8 i, j;
si32 corner_x, corner_y, sad;
corner_x = x << 2;
corner_y = y << 2;
sad = 0;
for (i = 0; i < 4; i++)
for (j = 0; j < 4; j++)
sad +=
abs (mb->orig_mb[corner_x + i][corner_y + j] -
mb->pred_mb[corner_x + i][corner_y + j]);
return sad;
}
where mb->orig_mb and mb->pred_mb are arrays of short int and not
unsigned char. I cannot therefore use psadbw, because it works on 8 bit
data. I've currently rewritten the function in this way:
si32
sad_4x4 (macroblock_t * mb, ui8 x, ui8 y)
{
zeros = _mm_setzero_si64 ();
ones = _mm_set1_pi16 (1);
orig = *((__m64*) &mb->orig_mb[corner_x][corner_y]);
pred = *((__m64*) &mb->pred_mb[corner_x][corner_y]);
diff = _m_psubw (orig, pred);
cmp = _m_pcmpgtw (zeros, diff);
sign = _m_paddw (ones, cmp);
sign = _m_paddw (sign, cmp);
sad = _m_pmaddwd (diff, sign);
orig = *((__m64*) &mb->orig_mb[corner_x+1][corner_y]);
pred = *((__m64*) &mb->pred_mb[corner_x+1][corner_y]);
diff = _m_psubw (orig, pred);
cmp = _m_pcmpgtw (zeros, diff);
sign = _m_paddw (ones, cmp);
sign = _m_paddw (sign, cmp);
cmp = _m_pmaddwd (diff, sign);
sad = _m_paddd (sad, cmp);
orig = *((__m64*) &mb->orig_mb[corner_x+2][corner_y]);
pred = *((__m64*) &mb->pred_mb[corner_x+2][corner_y]);
diff = _m_psubw (orig, pred);
cmp = _m_pcmpgtw (zeros, diff);
sign = _m_paddw (ones, cmp);
sign = _m_paddw (sign, cmp);
cmp = _m_pmaddwd (diff, sign);
sad = _m_paddd (sad, cmp);
orig = *((__m64*) &mb->orig_mb[corner_x+3][corner_y]);
pred = *((__m64*) &mb->pred_mb[corner_x+3][corner_y]);
diff = _m_psubw (orig, pred);
cmp = _m_pcmpgtw (zeros, diff);
sign = _m_paddw (ones, cmp);
sign = _m_paddw (sign, cmp);
cmp = _m_pmaddwd (diff, sign);
sad = _m_paddd (sad, cmp);
return _m_to_int (sad) + _m_to_int (_m_psrlqi (sad, 32));
}
but it isn't faster. Does anyone of you have got a hint to make it faster?
I've got another question: why don't you call _mm_empty when you use
intrinsic asm?
Thank you and excuse me for the OT.
--
Ottavio Campana
Telecommunication Engineer
Lab. Immagini
Dept. of Information Engineering
University of Padova
Via Gradenigo 6/B
35131 Padova
Italy
More information about the Theora-dev
mailing list