[theora-dev] Multi-thread Theora Decoder

Thu Feb 28 05:40:48 PST 2008

Leonardo de Paula Rosa Piga wrote:
> Hi all,
> 
> Does Theora Community have an interest in a multi-thread decoder implementation?
> 
> I'm starting to work with multi-thread and I thought that Theora
> Decoder is a good choice for me, because I had been working with it in
> a FPGA implementation and I have experience with the library.
> 
> I'm thinking in working with LoopFilter at first. Do you think I could
> start with it or there is a better choice?

Keep in mind that the current decoder is a completely different code
base than the one you based your FPGA implementation on. This one
re-organizes the decode process to be much more cache-friendly, which
makes it scale much better to larger resolutions (in addition to a
number of other algorithmic optimizations). Part of this means the
decode process is pipelined, so that, for example, the loop filter is
only applied to a few rows of blocks at a time, while the data for those
blocks is in cache, instead of being applied to the entire frame at once.

There were some early experiments with a multi-threaded decoder in svn:
http://svn.xiph.org/experimental/giles/theora-exp-mt/

The basic idea was that after coefficient unpacking, each color plane
could be decoded completely independently (inverse DC prediction,
dequantization, iDCT, MC, and loop filter). This obtained a moderate
speed-up (I forget the exact numbers, search the mailing list archives),
but the difference in the workload between luma and the two chroma
planes limited the potential gains.

gcc 4.2 includes support for OpenMP
(http://en.wikipedia.org/wiki/OpenMP). I have been meaning to take a
look at this and see if it is lightweight enough for finer-grain
parallelism, e.g., of the individual loops during the decode process.
Some of these are easy to parallelize further (e.g., copying uncoded
blocks), but most of the steps within a single color plane have a
dependency on the previous rows. There may be some opportunity to break
these dependencies with additional overhead, but it's unclear if the
trade-off is worth it (this requires experimentation). The loop filter
definitely falls into this category, as the precise output is extremely
dependent on the order in which it is applied to each edge, since all of
the edges overlap.

The ability to specify parallelism with a few #pragma statements in
OpenMP is, however, appealing, as this kind of code is hard to get
right, and this would be much easier than maintaining a separate,
multi-threaded version of the code. It also provides a portable means of
adapting to the number of cores, etc., available.