Hi Timothy, below some new and good results.<br><br><div class="gmail_quote">On Mon, Jul 7, 2008 at 1:52 AM, Timothy B. Terriberry &lt;<a href="mailto:tterribe@email.unc.edu">tterribe@email.unc.edu</a>&gt; wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

<div class="Ih2E3d">Leonardo de Paula Rosa Piga wrote:<br>

&gt; Coarse grain functions are the best functions to be parallelize to<br>

&gt; become the overhead worthwhile, but the current implementation has one,<br>

&gt; at most two. The parts that I suggested in my initial plan are fine<br>

<br>

</div>The reason the current decoder does this is cache coherency. The idea is<br>

that only a few (16 to 32) rows need to be kept in L1/L2 cache between<br>

each stage of the pipeline, which is a big reason the current decoder is<br>

as fast as it is on high resolution content.<br>

<br>

It&#39;s possible break this pipeline back up into separate stages that<br>

operate on the entire frame at once (e.g., just make the MCU size the<br>

height of the frame). You lose cache coherency, but get coarse-grained<br>

parallelism. Only testing will determine which is the better strategy.</blockquote><div>You are right! You gave me a great tip. I did some tests for different MCU&nbsp; size. The MCU size for the current implementation is 8.<br>

For MCU size &gt;= 16, PThread and OpenMP implementations produce the same results, that is, a speedup 13% on average. The time spend to thread communication was reduced.<br><br>I plotted three graphs to show these facts<br>

One for Real Time vs MCU size. This graph shows that for MCU size &gt;= 16 PThread and OpenMP implementations are equivalents.<br>(<a href="http://lampiao.lsc.ic.unicamp.br/~piga/gsoc_2008/comparison.png">http://lampiao.lsc.ic.unicamp.br/~piga/gsoc_2008/comparison.png</a>)<br>

<br>The second graph compares the speedup and prove that for coarse grain functions we can achieve better results.<br>(<a href="http://lampiao.lsc.ic.unicamp.br/~piga/gsoc_2008/speedup.png">http://lampiao.lsc.ic.unicamp.br/~piga/gsoc_2008/speedup.png</a>)<br>

<br>And to conclude the third graph. It was plotted the system time vs MCU size. For greater MCU size, lower is the system time. Because the thread communication overhead was reduced.<br><br></div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

<br>

<div class="Ih2E3d"><br>

&gt; the decoding time should be reduced in 33%, but it was just 10% for<br>

&gt; pthread an 5% for openMP.<br>

<br>

</div>The chroma coefficients are usually quantized much more coarsely, so<br>

they very likely don&#39;t account for a full 33% of the decode time even on<br>

a uniprocessor. Fewer coded blocks and fewer tokens to unpack in the<br>

blocks that are coded means fewer and smaller iDCTs, fewer invocations<br>

of the loop filter, etc.<br>

<br>

It&#39;s sad that OpenMP didn&#39;t do better... I was hoping with the option<br>

available to them to do platform-specific tricks, they could cut down on<br>

the overhead of pthreads, but I guess that stuff&#39;s just not &quot;there&quot; yet.</blockquote><div>The results above show that it is not the case. For coarse grain jobs they are equivalent<br>&nbsp;<br></div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

<br>

<div class="Ih2E3d"><br>

&gt; These version will run the functions (c_dec_dc_unpredict_mcu_plane +<br>

&gt; oc_dec_frags_recon_mcu_plane) and<br>

&gt; (oc_state_loop_filter_frag_rows + oc_state_borders_fill_rows) in<br>

&gt; parallel. The upper bound for the gain is 60%, that is, let T2 be a<br>

&gt; video decoded with the pipelined implementation. T2 should be at most 0.4To.<br>

<br>

</div>I think you mean &quot;at least&quot;. Let us know what your test results look<br>

like (good or bad)! Keep in mind that, if possible, the same thread that<br>

does oc_dec_dc_unpredict_mcu_plane+oc_dec_frags_recon_mcu_plane on a set<br>

of blocks should also be the one to do<br>

oc_state_loop_filter_frag_rows+oc_state_borders_fill_rows on the same<br>

set of blocks (and hopefully the scheduler doesn&#39;t muck things up by<br>

moving the thread to a different physical CPU inbetween).</blockquote><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><br>

_______________________________________________<br>

theora-dev mailing list<br>

<a href="mailto:theora-dev@xiph.org">theora-dev@xiph.org</a><br>

<a href="http://lists.xiph.org/mailman/listinfo/theora-dev" target="_blank">http://lists.xiph.org/mailman/listinfo/theora-dev</a><br>

<br>

</blockquote></div><br><br clear="all"><br>-- <br>Leonardo de Paula Rosa Piga<br>Undergraduate Computer Engineering Student <br>LSC - IC - UNICAMP<br><a href="http://lampiao.lsc.ic.unicamp.br/~piga">http://lampiao.lsc.ic.unicamp.br/~piga</a>