[theora-dev] Multi-thread Theora Decoder
Leonardo de Paula Rosa Piga
lpiga at terra.com.br
Sat Mar 15 14:43:16 PDT 2008
On Thu, Feb 28, 2008 at 10:40 AM, Timothy B. Terriberry
<tterribe at email.unc.edu> wrote:
> Leonardo de Paula Rosa Piga wrote:
> > Hi all,
> > Does Theora Community have an interest in a multi-thread decoder implementation?
> > I'm starting to work with multi-thread and I thought that Theora
> > Decoder is a good choice for me, because I had been working with it in
> > a FPGA implementation and I have experience with the library.
> > I'm thinking in working with LoopFilter at first. Do you think I could
> > start with it or there is a better choice?
> Keep in mind that the current decoder is a completely different code
> base than the one you based your FPGA implementation on. This one
> re-organizes the decode process to be much more cache-friendly, which
> makes it scale much better to larger resolutions (in addition to a
> number of other algorithmic optimizations). Part of this means the
> decode process is pipelined, so that, for example, the loop filter is
> only applied to a few rows of blocks at a time, while the data for those
> blocks is in cache, instead of being applied to the entire frame at once.
You are right. I studied the new version and it seems completely
different. I will study this new implementation to understand what
have been changed.
> There were some early experiments with a multi-threaded decoder in svn:
> The basic idea was that after coefficient unpacking, each color plane
> could be decoded completely independently (inverse DC prediction,
> dequantization, iDCT, MC, and loop filter). This obtained a moderate
> speed-up (I forget the exact numbers, search the mailing list archives),
> but the difference in the workload between luma and the two chroma
> planes limited the potential gains.
I took a look at the the experimental multi-thread version and I think
this approach could not be scalable. It just creates three threads and
if you have more than 3 processors they will be idle.
> gcc 4.2 includes support for OpenMP
> (http://en.wikipedia.org/wiki/OpenMP). I have been meaning to take a
> look at this and see if it is lightweight enough for finer-grain
> parallelism, e.g., of the individual loops during the decode process.
> Some of these are easy to parallelize further (e.g., copying uncoded
> blocks), but most of the steps within a single color plane have a
> dependency on the previous rows. There may be some opportunity to break
> these dependencies with additional overhead, but it's unclear if the
> trade-off is worth it (this requires experimentation). The loop filter
> definitely falls into this category, as the precise output is extremely
> dependent on the order in which it is applied to each edge, since all of
> the edges overlap.
> The ability to specify parallelism with a few #pragma statements in
> OpenMP is, however, appealing, as this kind of code is hard to get
> right, and this would be much easier than maintaining a separate,
> multi-threaded version of the code. It also provides a portable means of
> adapting to the number of cores, etc., available.
OpenMP could be a good choice, perhaps a mixed with pthread could be better.
I did a profile and I could see that the hot spots of algorithm are
oc_state_frag_recon and oc_state_loop_filter_frag_rows and related
functions. They correspond about 60% of the total running time of the
decoder. I will try to work initially in this part.
Do you have any other suggestion? Should I focus on maintainability
code or on performance?
> theora-dev mailing list
> theora-dev at xiph.org
Leonardo de Paula Rosa Piga
Undergraduate Computer Engineering Student
LSC - IC - UNICAMP
More information about the theora-dev