[opus] Antw: Re: Antw: [EXT] Opus merging streams

Wed Apr 6 12:02:25 UTC 2022

On 2022-04-06, Sampo Syreeni wrote:

>> For your information, I'm using an ARM M4F with Opus configured like this 
>> (40ms, 16kHz, 16 bitrate, 0 compres).
>
> Just unpack and sum.

To affirm, what you need to do is find a suitable outbut buffer 
synchronization algorithm, and the parallelize the fuck out of your 
code. In modern embedded and mobile architectures, our performance will 
be dictated by how many cores you utilize, consistently, not by exact 
optimizations in your you utilize your libraries.

Since 40ms at 16kHz is 640 samples, you'd do a 1280 sample circulating 
buffer for each input. You'd do a balanced tree of pairwise sums of 
those towards a common output buffer, using intermediate buffers, 
preferably held in private and faster memory space for each core. 
Preferably the first level cache, but probably L2. Unlike in the input 
buffers, you'd probably mostly like to treat the secondary buffers as 
reset to beginning, or otherwise cache aligned; time base alignment can 
be done with the freely running ring buffers at the first stage, and 
isn't much needed further down the line. There is a cost in latency, but 
you can easily round it down to a cacheline, so something like 32-64 
bytes. That rounds down nicely from 640 samples, to either ten or twenty 
lines.

Do make note that most multiprocessor architectures can do either one of 
two things: 1) NUMA architectures expose private memory, which doesn't 
need any synchronization in-thread, and 2) where uniform, cache coherent 
memory access is available, often there are primitives which allow a 
certain memory location to be added to coherently. Use the first idea to 
add to parts of the addition tree without synchro, use the second part 
to synchro with the shared memory part of the tree. Just add in linearly 
incresing offset what you have, and then let the cache sort it out. 
Otherwise try to maintain a scheduling round which writes linearly in 
shared memory and only synchronizes on cache line boundaries.

Now you have at most 80ms buffer for arbitrary timebase alignment and a 
fixed Opus blockwidth. It's rather a lot, but you can scale down from 
that in two's if you're willing to take the hit in efficiency of 
prediction. Your architecture is fully synchronised from the first layer 
down the addition tree, and you can use separate cores to populate the 
separate input buffers, preferably from separate TCP/UDP streams down 
the stack.

That's about as good as you can get, presuming a fixed block width from 
OPUS. Which it has; it isn't a zero delay codec after all.
-- 
Sampo Syreeni, aka decoy - decoy at iki.fi, http://decoy.iki.fi/front
+358-40-3751464, 025E D175 ABE5 027C 9494 EEB0 E090 8BA9 0509 85C2