I forgot to send the link for the last graph

Hi Timothy, below some new and good results.
>> > Coarse grain functions are the best functions to be parallelize to
>> > become the overhead worthwhile, but the current implementation has one,
>> > at most two. The parts that I suggested in my initial plan are fine
>> The reason the current decoder does this is cache coherency. The idea is
>> that only a few (16 to 32) rows need to be kept in L1/L2 cache between
>> each stage of the pipeline, which is a big reason the current decoder is
>> as fast as it is on high resolution content.
>> It's possible break this pipeline back up into separate stages that
>> operate on the entire frame at once (e.g., just make the MCU size the
>> height of the frame). You lose cache coherency, but get coarse-grained
>> parallelism. Only testing will determine which is the better strategy.
> You are right! You gave me a great tip. I did some tests for different MCU
> size. The MCU size for the current implementation is 8.
> For MCU size >= 16, PThread and OpenMP implementations produce the same
> results, that is, a speedup 13% on average. The time spend to thread
> communication was reduced.
> I plotted three graphs to show these facts
> One for Real Time vs MCU size. This graph shows that for MCU size >= 16
> PThread and OpenMP implementations are equivalents.
> (http://lampiao.lsc.ic.unicamp.br/~piga/gsoc_2008/comparison.png<http://lampiao.lsc.ic.unicamp.br/%7Epiga/gsoc_2008/comparison.png>
> )
> The second graph compares the speedup and prove that for coarse grain
> functions we can achieve better results.
> (http://lampiao.lsc.ic.unicamp.br/~piga/gsoc_2008/speedup.png<http://lampiao.lsc.ic.unicamp.br/%7Epiga/gsoc_2008/speedup.png>
> )
> And to conclude the third graph. It was plotted the system time vs MCU
> size. For greater MCU size, lower is the system time. Because the thread
> communication overhead was reduced.
>> > the decoding time should be reduced in 33%, but it was just 10% for
>> > pthread an 5% for openMP.
>> The chroma coefficients are usually quantized much more coarsely, so
>> they very likely don't account for a full 33% of the decode time even on
>> a uniprocessor. Fewer coded blocks and fewer tokens to unpack in the
>> blocks that are coded means fewer and smaller iDCTs, fewer invocations
>> of the loop filter, etc.
>> It's sad that OpenMP didn't do better... I was hoping with the option
>> available to them to do platform-specific tricks, they could cut down on
>> the overhead of pthreads, but I guess that stuff's just not "there" yet.
> The results above show that it is not the case. For coarse grain jobs they
> are equivalent
>> > These version will run the functions (c_dec_dc_unpredict_mcu_plane +
>> > oc_dec_frags_recon_mcu_plane) and
>> > (oc_state_loop_filter_frag_rows + oc_state_borders_fill_rows) in
>> > parallel. The upper bound for the gain is 60%, that is, let T2 be a
>> > video decoded with the pipelined implementation. T2 should be at most
>> 0.4To.
>> I think you mean "at least". Let us know what your test results look
>> like (good or bad)! Keep in mind that, if possible, the same thread that
>> does oc_dec_dc_unpredict_mcu_plane+oc_dec_frags_recon_mcu_plane on a set
>> of blocks should also be the one to do
>> oc_state_loop_filter_frag_rows+oc_state_borders_fill_rows on the same
>> set of blocks (and hopefully the scheduler doesn't muck things up by
>> moving the thread to a different physical CPU inbetween).
