[theora-dev] Multithread support

Wed Feb 4 03:31:44 PST 2015

M. Pabis wrote:
> 1. Each thread deals with frames from intra frame up to next intra frame
> - 1;

This works if you know where the intra frames are. Currently the frame 
type decision is made by trying to encode as an inter frame, and keeping 
statistics on expect rate and distortion from using all intra modes 
during mode decision. Then if it looks like an all-intra frame is likely 
to be more efficient, the frame is re-encoded as a keyframe. There is no 
lookahead at all.

You could certainly do this in two-pass mode, but the first pass mode is 
not very much faster than the second pass. In fact, I'm pretty sure you 
could do this without any modification to libtheora at all.

> 2. Each thread deals with 1/n-th of the duration, and all outputs are
> finally concatenated.

This is pretty similar to 1, except that you can be more relaxed about 
picking your partition points (i.e., if you put a keyframe in the wrong 
place 4 or 8 times in a whole sequence, the overhead will not be that 
large). Again, I think you can do this with no modifications to 
libtheora at all.

In both cases the real trick will be rate control, since unless you're 
doing average bitrate, the number of bits you want to spend on each 
segment can vary quite a lot. If you are doing average bitrate, then 
this is easy.

This is what sites like YouTube already do to reduce the latency between 
video upload and a video being available, and you can do this even with 
an encoder that is itself multithreaded (i.e., splitting across multiple 
machines instead of threads). Whether or not your encoder is 
multithreaded just controls how many segments you need to split the 
sequence into for a desired degree of parallelism.

> 3. Maybe not a multithreading, but parallel/vector computing - encoding
> one frame, divided into small areas and processed on OpenCL or CUDA.

Lots of people have tried to do something like this for various codecs, 
but I'm not aware of anyone ever getting any real improvements. A lot of 
the processing does not work well on a GPU, and the data-marshalling to 
get information back and forth between the CPU and GPU tend to wipe out 
the gains from parallelism.

I would personally suggest not wasting time on this approach.

> right? As this is a variation of concept #1 you described, CUDA and
> OpenCL have efficient mechanisms to deal with synchronization, memory
> sharing etc. This approach probably would benefit with higher

Well, they mostly deal with it by not synchronizing. I.e., to get good 
performance you need something on the order of 1000-way parallelism 
among tasks that do not have to synchronize with each other. They have 
grown some synchronization mechanisms, but they have a huge performance 
penalty on these architectures.