[theora-dev] GSoC - Theora multithread decoder

Sun Jul 6 17:39:55 PDT 2008

Hi all,

I apologize to not keep you up to date to what is going on with my project.
Portavales has worked in a desk behind me and when we go to take coffee we
talk about the project. Second I didn't know we have to discuss weekly, it
was my fault. I should have read the rules. Sorry.

At the first month, I studied the code and the Theora Beta implementation.
The code is completely different from Alpha and I have to be familiarized
with the code.
After that I started doing tests with OpenMP.

One first test was 40% faster, but unfortunately it did not decode the frame
correctly, three quarters was green.

I have one implementation decoding the Y, Cb and Cr planes in parallel. The
OpenMP implementation was about 5% faster. Not worthless, since it does not
require any great modifications.

I looked at Ralph's implementation and merged it to the current. The speed
up was about 10% but the code have to be modified in many places.

Extract parallelism from the current implementation is very difficult.
Coarse grain functions are the best functions to be parallelize to become
the overhead worthwhile, but the current implementation has one, at most
two. The parts that I suggested in my initial plan are fine grain functions,
they spend a lot of cpu time but they are called too many times. The time
spent to create and synchronize threads is greater than the speed up gains.
We need functions that are called a few times and spend many cpu time. Also
data dependency should be the lowest as possible.

According to the model that i did (
http://lampiao.lsc.ic.unicamp.br/~piga/gsoc_2008/implementation.pdf<http://lampiao.lsc.ic.unicamp.br/%7Epiga/gsoc_2008/implementation.pdf>)
the decoding time should be reduced in 33%, but it was just 10% for pthread
an 5% for openMP.

I used a video with 1440x1080. The pthread implementation has 3 threads and
the OpenMP was executed with the environment variable OMP_NUM_THREADS=3. The
results are:

                   Real(s)         User(s)             System(s)
Speed up(%)
OpenMP      25.2             29.2                  1.8
4
PThread       23.8             28.3                  1.0
  10
Current        26.2             26.0
0.3                       0

I used an Intel(R) Core(TM)2 Quad CPU with 2.4GHz and RAM of 4GB. The video
has 85 seconds.
These two implementations decode the Y, Cb and Cr planes in parallel, that
is why I am using OMP_NUM_THREADS=3 and the upper bound gain is 33%, that
is, let To be the time spent in decoding a video with the current
implementation. Let T1 be a video decoded with the parallel implementation.
T1 should be at most 0.66To.

I will use the pthread implementation to try a pipelined version and see if
we obtain more gains.
These version will run the functions (c_dec_dc_unpredict_mcu_plane +
oc_dec_frags_recon_mcu_plane) and
(oc_state_loop_filter_frag_rows + oc_state_borders_fill_rows) in parallel.
The upper bound for the gain is 60%, that is, let T2 be a video decoded with
the pipelined implementation. T2 should be at most 0.4To.

Here is the branch for the OpenMP implementation:
http://svn.xiph.org/branches/theora_multithread_decode_omp/
Here is the branch for the PThread implementation:
http://svn.xiph.org/branches/theora_multithread_decode_pthread/

Again, sorry about the long time without any feedback.

-- 
Leonardo de Paula Rosa Piga
Undergraduate Computer Engineering Student
LSC - IC - UNICAMP
http://lampiao.lsc.ic.unicamp.br/~piga<http://lampiao.lsc.ic.unicamp.br/%7Epiga>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.xiph.org/pipermail/theora-dev/attachments/20080706/6aaf4132/attachment.htm