No subject

Mon Mar 24 17:02:25 PDT 2008

Shows that as the MCU increases, the OpenMP extra overhead is
amortized and OpenMP becomes as fast as the pthreads implementation.

The last chart
http://lampiao.lsc.ic.unicamp.br/~piga/gsoc_2008/systime.png

Shows that both pthreads and OpenMP overhead decreases as what seems
to be a logarithmic function of the MCU size.

This was a great experiment, and from what I can conclude, the OpenMP
implementation can be as good as the pthread.

Therefore maybe it is worth to work on a OpenMP implementation because
of the easiness of code maintenance.

What you guys think ?

Cheers,
Felipe

On Mon, Jul 14, 2008 at 1:04 AM, Leonardo de Paula Rosa Piga
<lpiga at terra.com.br> wrote:
> I forgot to send the link for the last graph
> (http://lampiao.lsc.ic.unicamp.br/~piga/gsoc_2008/systime.png)
>
> On Mon, Jul 14, 2008 at 1:03 AM, Leonardo de Paula Rosa Piga
> <lpiga at terra.com.br> wrote:
>>
>> Hi Timothy, below some new and good results.
>>
>> On Mon, Jul 7, 2008 at 1:52 AM, Timothy B. Terriberry
>> <tterribe at email.unc.edu> wrote:
>>>
>>> Leonardo de Paula Rosa Piga wrote:
>>> > Coarse grain functions are the best functions to be parallelize to
>>> > become the overhead worthwhile, but the current implementation has one,
>>> > at most two. The parts that I suggested in my initial plan are fine
>>>
>>> The reason the current decoder does this is cache coherency. The idea is
>>> that only a few (16 to 32) rows need to be kept in L1/L2 cache between
>>> each stage of the pipeline, which is a big reason the current decoder is
>>> as fast as it is on high resolution content.
>>>
>>> It's possible break this pipeline back up into separate stages that
>>> operate on the entire frame at once (e.g., just make the MCU size the
>>> height of the frame). You lose cache coherency, but get coarse-grained
>>> parallelism. Only testing will determine which is the better strategy.
>>
>> You are right! You gave me a great tip. I did some tests for different
>> MCU  size. The MCU size for the current implementation is 8.
>> For MCU size >= 16, PThread and OpenMP implementations produce the same
>> results, that is, a speedup 13% on average. The time spend to thread
>> communication was reduced.
>>
>> I plotted three graphs to show these facts
>> One for Real Time vs MCU size. This graph shows that for MCU size >= 16
>> PThread and OpenMP implementations are equivalents.
>> (http://lampiao.lsc.ic.unicamp.br/~piga/gsoc_2008/comparison.png)
>>
>> The second graph compares the speedup and prove that for coarse grain
>> functions we can achieve better results.
>> (http://lampiao.lsc.ic.unicamp.br/~piga/gsoc_2008/speedup.png)
>>
>> And to conclude the third graph. It was plotted the system time vs MCU
>> size. For greater MCU size, lower is the system time. Because the thread
>> communication overhead was reduced.
>>
>>>
>>>
>>> > the decoding time should be reduced in 33%, but it was just 10% for
>>> > pthread an 5% for openMP.
>>>
>>> The chroma coefficients are usually quantized much more coarsely, so
>>> they very likely don't account for a full 33% of the decode time even on
>>> a uniprocessor. Fewer coded blocks and fewer tokens to unpack in the
>>> blocks that are coded means fewer and smaller iDCTs, fewer invocations
>>> of the loop filter, etc.
>>>
>>> It's sad that OpenMP didn't do better... I was hoping with the option
>>> available to them to do platform-specific tricks, they could cut down on
>>> the overhead of pthreads, but I guess that stuff's just not "there" yet.
>>
>> The results above show that it is not the case. For coarse grain jobs they
>> are equivalent
>>
>>>
>>>
>>> > These version will run the functions (c_dec_dc_unpredict_mcu_plane +
>>> > oc_dec_frags_recon_mcu_plane) and
>>> > (oc_state_loop_filter_frag_rows + oc_state_borders_fill_rows) in
>>> > parallel. The upper bound for the gain is 60%, that is, let T2 be a
>>> > video decoded with the pipelined implementation. T2 should be at most
>>> > 0.4To.
>>>
>>> I think you mean "at least". Let us know what your test results look
>>> like (good or bad)! Keep in mind that, if possible, the same thread that
>>> does oc_dec_dc_unpredict_mcu_plane+oc_dec_frags_recon_mcu_plane on a set
>>> of blocks should also be the one to do
>>> oc_state_loop_filter_frag_rows+oc_state_borders_fill_rows on the same
>>> set of blocks (and hopefully the scheduler doesn't muck things up by
>>> moving the thread to a different physical CPU inbetween).
>>>
>>> _______________________________________________
>>> theora-dev mailing list
>>> theora-dev at xiph.org
>>> http://lists.xiph.org/mailman/listinfo/theora-dev
>>>
>>
>>
>>
>> --
>> Leonardo de Paula Rosa Piga
>> Undergraduate Computer Engineering Student
>> LSC - IC - UNICAMP
>> http://lampiao.lsc.ic.unicamp.br/~piga
>
>
> --
> Leonardo de Paula Rosa Piga
> Undergraduate Computer Engineering Student
> LSC - IC - UNICAMP
> http://lampiao.lsc.ic.unicamp.br/~piga
> _______________________________________________
> theora-dev mailing list
> theora-dev at xiph.org
> http://lists.xiph.org/mailman/listinfo/theora-dev
>
>

-- 
________________________________________
Felipe Portavales Goldstein <portavales at gmail>
Undergraduate Student - IC-UNICAMP
Computer Systems Laboratory
http://lampiao.lsc.ic.unicamp.br/~portavales/