[theora-dev] Multithread support
Timothy B. Terriberry
tterribe at xiph.org
Wed Feb 4 03:31:44 PST 2015
M. Pabis wrote:
> 1. Each thread deals with frames from intra frame up to next intra frame
> - 1;
This works if you know where the intra frames are. Currently the frame
type decision is made by trying to encode as an inter frame, and keeping
statistics on expect rate and distortion from using all intra modes
during mode decision. Then if it looks like an all-intra frame is likely
to be more efficient, the frame is re-encoded as a keyframe. There is no
lookahead at all.
You could certainly do this in two-pass mode, but the first pass mode is
not very much faster than the second pass. In fact, I'm pretty sure you
could do this without any modification to libtheora at all.
> 2. Each thread deals with 1/n-th of the duration, and all outputs are
> finally concatenated.
This is pretty similar to 1, except that you can be more relaxed about
picking your partition points (i.e., if you put a keyframe in the wrong
place 4 or 8 times in a whole sequence, the overhead will not be that
large). Again, I think you can do this with no modifications to
libtheora at all.
In both cases the real trick will be rate control, since unless you're
doing average bitrate, the number of bits you want to spend on each
segment can vary quite a lot. If you are doing average bitrate, then
this is easy.
This is what sites like YouTube already do to reduce the latency between
video upload and a video being available, and you can do this even with
an encoder that is itself multithreaded (i.e., splitting across multiple
machines instead of threads). Whether or not your encoder is
multithreaded just controls how many segments you need to split the
sequence into for a desired degree of parallelism.
> 3. Maybe not a multithreading, but parallel/vector computing - encoding
> one frame, divided into small areas and processed on OpenCL or CUDA.
Lots of people have tried to do something like this for various codecs,
but I'm not aware of anyone ever getting any real improvements. A lot of
the processing does not work well on a GPU, and the data-marshalling to
get information back and forth between the CPU and GPU tend to wipe out
the gains from parallelism.
I would personally suggest not wasting time on this approach.
> right? As this is a variation of concept #1 you described, CUDA and
> OpenCL have efficient mechanisms to deal with synchronization, memory
> sharing etc. This approach probably would benefit with higher
Well, they mostly deal with it by not synchronizing. I.e., to get good
performance you need something on the order of 1000-way parallelism
among tasks that do not have to synchronize with each other. They have
grown some synchronization mechanisms, but they have a huge performance
penalty on these architectures.
More information about the theora-dev