This week I will work with the pipeline and by the end of this week I will send a report.<br><br><br><div class="gmail_quote">On Sun, Jul 6, 2008 at 9:39 PM, Leonardo de Paula Rosa Piga &lt;<a href="mailto:lpiga@terra.com.br">lpiga@terra.com.br</a>&gt; wrote:<br>

<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><div>Hi all,<br><br>I apologize to not keep you up to date to what is going

on with my project. Portavales has worked in a desk behind me and when

we go to take coffee we talk about the project. Second I didn&#39;t know we

have to discuss weekly, it was my fault. I should have read the rules.

Sorry.<br>

<br>At the first month, I studied the code and the Theora Beta

implementation. The code is completely different from Alpha and I have

to be familiarized with the code.<br></div><div>After that I started doing tests with OpenMP.<br>

<br></div><div>One first test was 40% faster, but unfortunately it did not decode the frame correctly, three quarters was green.<br><br></div><div> I have one implementation decoding the Y, Cb and Cr planes in parallel. The OpenMP implementation was about 5% faster. Not

worthless, since it does not require any great modifications. <br>

<br>I looked at Ralph&#39;s implementation and merged it to the current.

The speed up was about 10% but the code have to be modified in many

places.<br clear="all">

<br></div>Extract parallelism from the current implementation is very

difficult. Coarse grain functions are the best functions to be

parallelize to become the overhead worthwhile, but the current

implementation has one, at most two. The parts that I suggested in my

initial plan are fine grain functions, they spend a lot of cpu time but

they are called too many times. The time spent to create and synchronize

threads is greater than the speed up gains. We need functions that are

called a few times and spend many cpu time. Also data dependency should

be the lowest as possible.<br>

<br>According to the model that i did (<a href="http://lampiao.lsc.ic.unicamp.br/%7Epiga/gsoc_2008/implementation.pdf" target="_blank">http://lampiao.lsc.ic.unicamp.br/~piga/gsoc_2008/implementation.pdf</a>)

the decoding time should be reduced in 33%, but it was just 10% for pthread an 5% for openMP.<br><br>I used a video with 1440x1080. The pthread implementation has 3 threads

and the OpenMP was executed with the environment variable

OMP_NUM_THREADS=3. The results are:<br>

<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Real(s)&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; User(s)&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; System(s)&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Speed up(%)<br>OpenMP &nbsp; &nbsp;&nbsp; 25.2&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 29.2&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.8 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 4<br>PThread&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 23.8&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 28.3&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.0 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 10<br>

Current &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 26.2&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 26.0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0.3&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0<br><br>I used an Intel(R) Core(TM)2 Quad CPU with 2.4GHz and RAM of 4GB. The video has 85 seconds.<br>These

two implementations decode the Y, Cb and Cr planes in parallel, that

is why I am using OMP_NUM_THREADS=3 and the upper bound gain is 33%,

that is, let To be the time spent in decoding a video with the current

implementation. Let T1 be a video decoded with the parallel

implementation. T1 should be at most 0.66To.<div><br>

<br>I will use the

pthread implementation to try a pipelined version and see if we obtain

more gains.<br></div>These version will run the functions (c_dec_dc_unpredict_mcu_plane + oc_dec_frags_recon_mcu_plane) and<br>(oc_state_loop_filter_frag_rows

+ oc_state_borders_fill_rows) in parallel. The upper bound for the gain

is 60%, that is, let T2 be a video decoded with the pipelined

implementation. T2 should be at most 0.4To.<br>

<br>Here is the branch for the OpenMP implementation: <a href="http://svn.xiph.org/branches/theora_multithread_decode_omp/" target="_blank">http://svn.xiph.org/branches/theora_multithread_decode_omp/</a><br>Here is the branch for the PThread implementation: <a href="http://svn.xiph.org/branches/theora_multithread_decode_pthread/" target="_blank">http://svn.xiph.org/branches/theora_multithread_decode_pthread/</a><br>

<br><br><br><br><br>Again, sorry about the long time without any feedback.<br clear="all"><font color="#888888"><br>-- <br>Leonardo de Paula Rosa Piga<br>Undergraduate Computer Engineering Student <br>LSC - IC - UNICAMP<br>

<a href="http://lampiao.lsc.ic.unicamp.br/%7Epiga" target="_blank">http://lampiao.lsc.ic.unicamp.br/~piga</a>

</font></blockquote></div><br><br clear="all"><br>-- <br>Leonardo de Paula Rosa Piga<br>Undergraduate Computer Engineering Student <br>LSC - IC - UNICAMP<br><a href="http://lampiao.lsc.ic.unicamp.br/~piga">http://lampiao.lsc.ic.unicamp.br/~piga</a>