[theora-dev] Multi-thread Theora Decoder

Sat Mar 15 14:43:16 PDT 2008

On Thu, Feb 28, 2008 at 10:40 AM, Timothy B. Terriberry
<tterribe at email.unc.edu> wrote:
> Leonardo de Paula Rosa Piga wrote:
>  > Hi all,
>  >
>  > Does Theora Community have an interest in a multi-thread decoder implementation?
>  >
>  > I'm starting to work with multi-thread and I thought that Theora
>  > Decoder is a good choice for me, because I had been working with it in
>  > a FPGA implementation and I have experience with the library.
>  >
>  > I'm thinking in working with LoopFilter at first. Do you think I could
>  > start with it or there is a better choice?
>
>  Keep in mind that the current decoder is a completely different code
>  base than the one you based your FPGA implementation on. This one
>  re-organizes the decode process to be much more cache-friendly, which
>  makes it scale much better to larger resolutions (in addition to a
>  number of other algorithmic optimizations). Part of this means the
>  decode process is pipelined, so that, for example, the loop filter is
>  only applied to a few rows of blocks at a time, while the data for those
>  blocks is in cache, instead of being applied to the entire frame at once.
You are right. I studied the new version and it seems completely
different. I will study this new implementation to understand what
have been changed.

>
>  There were some early experiments with a multi-threaded decoder in svn:
>  http://svn.xiph.org/experimental/giles/theora-exp-mt/
>
>  The basic idea was that after coefficient unpacking, each color plane
>  could be decoded completely independently (inverse DC prediction,
>  dequantization, iDCT, MC, and loop filter). This obtained a moderate
>  speed-up (I forget the exact numbers, search the mailing list archives),
>  but the difference in the workload between luma and the two chroma
>  planes limited the potential gains.
I took a look at the the experimental multi-thread version and I think
this approach could not be scalable. It just creates three threads and
if you have more than 3 processors they will be idle.

>
>  gcc 4.2 includes support for OpenMP
>  (http://en.wikipedia.org/wiki/OpenMP). I have been meaning to take a
>  look at this and see if it is lightweight enough for finer-grain
>  parallelism, e.g., of the individual loops during the decode process.
>  Some of these are easy to parallelize further (e.g., copying uncoded
>  blocks), but most of the steps within a single color plane have a
>  dependency on the previous rows. There may be some opportunity to break
>  these dependencies with additional overhead, but it's unclear if the
>  trade-off is worth it (this requires experimentation). The loop filter
>  definitely falls into this category, as the precise output is extremely
>  dependent on the order in which it is applied to each edge, since all of
>  the edges overlap.
>
>  The ability to specify parallelism with a few #pragma statements in
>  OpenMP is, however, appealing, as this kind of code is hard to get
>  right, and this would be much easier than maintaining a separate,
>  multi-threaded version of the code. It also provides a portable means of
>  adapting to the number of cores, etc., available.
OpenMP could be a good choice, perhaps a mixed with pthread could be better.

I did a profile and I could see that the hot spots of algorithm are
oc_state_frag_recon and oc_state_loop_filter_frag_rows and related
functions. They correspond about 60% of the total running time of the
decoder. I will try to work initially in this part.

Do you have any other suggestion? Should I focus on maintainability
code or on performance?

>  _______________________________________________
>  theora-dev mailing list
>  theora-dev at xiph.org
>  http://lists.xiph.org/mailman/listinfo/theora-dev
>
>

-- 
Leonardo de Paula Rosa Piga
Undergraduate Computer Engineering Student
LSC - IC - UNICAMP
http://www.students.ic.unicamp.br/~ra033956