<div dir="ltr">Hi, thanks for some <div class="gmail_extra"><br><div class="gmail_quote">On Wed, Feb 4, 2015 at 5:17 AM, Timothy B. Terriberry <span dir="ltr"><<a href="mailto:tterribe@vt.edu" target="_blank">tterribe@vt.edu</a>></span> wrote:</div><div class="gmail_quote"><br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
I don't believe anyone has been working on this for some years. There are two basic approaches.<br>
<br>
One is threading within a single frame, which does not require any API behavior changes. In theory you can scale to a fairly decent number of threads everywhere except the final conversion from tokens to VLC codes in oc_enc_frame_pack(). However, the units of work are sufficiently small and the task dependencies sufficiently involved that this needs some kind of lock-free work-stealing queues to have a hope of getting more benefit from the parallelism than you pay in synchronization overhead. I'd started designing one with the hope that all memory allocations could be done up-front at encoder initialization (to avoid locking contention there), but this turns out to be sufficiently different from how most lock-free data structures worked at the time that it was a fair amount of work. I've been meaning to look at what Mozilla's Servo project is doing for this these days (since they have similar challenges).<br>
<br>
The other is traditional FFmpeg-style frame threading, which gives each thread a separate frame to encode, and merely waits for enough rows of the previous frame to be finished so that it can start its motion search. This is generally much more effective than threading within a frame, but a) requires additional delay (the API supports this in theory, but software using that API might not expect it, so it would have to be enabled manually through some sort of th_encode_ctl call) and b) requires changes to the rate control to deal with the fact that statistics from the previous frame are not immediately available. b) was the real blocker here.<br>
<br></blockquote></div><div class="gmail_extra"><br></div>I have read Theora Specification (from March 2011) and I have some more ideas.</div><div class="gmail_extra"><br></div><div class="gmail_extra">1. Each thread deals with frames from intra frame up to next intra frame - 1;</div><div class="gmail_extra">2. Each thread deals with 1/n-th of the duration, and all outputs are finally concatenated.</div><div class="gmail_extra">3. Maybe not a multithreading, but parallel/vector computing - encoding one frame, divided into small areas and processed on OpenCL or CUDA.</div><div class="gmail_extra"><br></div><div class="gmail_extra">I'm aware these are rather naive approaches. Mostly because they need to have enough data upfront. And for 1. - stream encoding would introduce some latency. And with nowadays processor power encoding can be done in realtime, so no speedup with streamed video. Maybe one could spend more time finding better compression.<br clear="all"><div><br></div><div>Well, 2. is totally naive. But, if the whole video is available, the speed up should be almost linear.</div><div><br></div><div>About 3. Well it's a vendor lock ;-) But hey, better this than nothing, right? As this is a variation of concept #1 you described, CUDA and OpenCL have efficient mechanisms to deal with synchronization, memory sharing etc. This approach probably would benefit with higher resolutions. CUDA and/or OpenCL could be also performing concept #2, with the same limitations unfortunately. </div><div><br></div>-- <br><div class="gmail_signature">Best regards</div><div class="gmail_signature">Mateusz Pabis</div>
</div></div>