[theora-dev] Parallel processing for Theora?

Sun Mar 21 07:04:08 PDT 2010

Id Kong wrote:
>> We've talked about it, but one of the problems is that the current API
>> is one-frame-in-one-frame-out, so it's not clear how to pass multiple

This is _a_ problem, certainly, but probably not a blocking one. I
wanted to see how well I could actually make within-frame parallelism
work with acceptable performance before deciding if it was necessary.
The real use-case for a parallel encoder is live streaming, since
otherwise you can just encode lots of different videos at once (and the
latter will always be more efficient if you actually have lots of videos
and don't care about latency).

>> There is some scope for multithreading the per-frame pipeline, but
>> that only scales to three or four threads.

The work I had in the pipeline was for parallel _decoding_ first,
because this is considerably easier. If it works out, parallel encoding
can be done within the same framework. The design work for this is
already done, and I had started on the implementation, but it got put
down as priorities changed and will probably not get picked up again
before the 1.2 release.

It's unclear how well it will scale. Years ago we did a (very simple)
parallel decoder that only partitioned things by color plane, and the
speed-up was disappointing. There was also a GSoC project that attempted
to improve on this, but it was not successfully completed. In theory you
could have a separate thread for every MCU (64 pixels of height for
4:2:0, 32 pixels for 4:2:2 and 4:4:4). However, within-frame parallelism
is fairly fine-grained, and the overhead of a standard mutex-based
library like pthreads is pretty enormous for this kind of thing. People
have reported getting speed-ups on FFT workloads as small as 10,000
cycles with lock-free algorithms (by comparison, a single pthread mutex
acquisition could take thousands of cycles by itself), but they made
very specific assumptions about architecture, cache line size, etc.

In short, this requires some non-trivial engineering to get it to work
well, and for the moment my priorities are still on improving encoder
quality before worrying about encoder speed. If there are qualified
people out there willing to work on this, I'd be happy to explain the
details of what needs to be done.