[theora-dev] GSoC - Theora multithread decoder
Leonardo de Paula Rosa Piga
lpiga at terra.com.br
Sun Jul 6 17:43:32 PDT 2008
This week I will work with the pipeline and by the end of this week I will
send a report.
On Sun, Jul 6, 2008 at 9:39 PM, Leonardo de Paula Rosa Piga <
lpiga at terra.com.br> wrote:
> Hi all,
> I apologize to not keep you up to date to what is going on with my project.
> Portavales has worked in a desk behind me and when we go to take coffee we
> talk about the project. Second I didn't know we have to discuss weekly, it
> was my fault. I should have read the rules. Sorry.
> At the first month, I studied the code and the Theora Beta implementation.
> The code is completely different from Alpha and I have to be familiarized
> with the code.
> After that I started doing tests with OpenMP.
> One first test was 40% faster, but unfortunately it did not decode the
> frame correctly, three quarters was green.
> I have one implementation decoding the Y, Cb and Cr planes in parallel. The
> OpenMP implementation was about 5% faster. Not worthless, since it does not
> require any great modifications.
> I looked at Ralph's implementation and merged it to the current. The speed
> up was about 10% but the code have to be modified in many places.
> Extract parallelism from the current implementation is very difficult.
> Coarse grain functions are the best functions to be parallelize to become
> the overhead worthwhile, but the current implementation has one, at most
> two. The parts that I suggested in my initial plan are fine grain functions,
> they spend a lot of cpu time but they are called too many times. The time
> spent to create and synchronize threads is greater than the speed up gains.
> We need functions that are called a few times and spend many cpu time. Also
> data dependency should be the lowest as possible.
> According to the model that i did (
> the decoding time should be reduced in 33%, but it was just 10% for pthread
> an 5% for openMP.
> I used a video with 1440x1080. The pthread implementation has 3 threads and
> the OpenMP was executed with the environment variable OMP_NUM_THREADS=3. The
> results are:
> Real(s) User(s)
> System(s) Speed up(%)
> OpenMP 25.2 29.2 1.8
> PThread 23.8 28.3 1.0
> Current 26.2 26.0
> 0.3 0
> I used an Intel(R) Core(TM)2 Quad CPU with 2.4GHz and RAM of 4GB. The video
> has 85 seconds.
> These two implementations decode the Y, Cb and Cr planes in parallel, that
> is why I am using OMP_NUM_THREADS=3 and the upper bound gain is 33%, that
> is, let To be the time spent in decoding a video with the current
> implementation. Let T1 be a video decoded with the parallel implementation.
> T1 should be at most 0.66To.
> I will use the pthread implementation to try a pipelined version and see if
> we obtain more gains.
> These version will run the functions (c_dec_dc_unpredict_mcu_plane +
> oc_dec_frags_recon_mcu_plane) and
> (oc_state_loop_filter_frag_rows + oc_state_borders_fill_rows) in parallel.
> The upper bound for the gain is 60%, that is, let T2 be a video decoded with
> the pipelined implementation. T2 should be at most 0.4To.
> Here is the branch for the OpenMP implementation:
> Here is the branch for the PThread implementation:
> Again, sorry about the long time without any feedback.
> Leonardo de Paula Rosa Piga
> Undergraduate Computer Engineering Student
> LSC - IC - UNICAMP
Leonardo de Paula Rosa Piga
Undergraduate Computer Engineering Student
LSC - IC - UNICAMP
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the theora-dev