[opus] Antw: Re: Antw: [EXT] Opus merging streams
decoy at iki.fi
Wed Apr 6 12:02:25 UTC 2022
On 2022-04-06, Sampo Syreeni wrote:
>> For your information, I'm using an ARM M4F with Opus configured like this
>> (40ms, 16kHz, 16 bitrate, 0 compres).
> Just unpack and sum.
To affirm, what you need to do is find a suitable outbut buffer
synchronization algorithm, and the parallelize the fuck out of your
code. In modern embedded and mobile architectures, our performance will
be dictated by how many cores you utilize, consistently, not by exact
optimizations in your you utilize your libraries.
Since 40ms at 16kHz is 640 samples, you'd do a 1280 sample circulating
buffer for each input. You'd do a balanced tree of pairwise sums of
those towards a common output buffer, using intermediate buffers,
preferably held in private and faster memory space for each core.
Preferably the first level cache, but probably L2. Unlike in the input
buffers, you'd probably mostly like to treat the secondary buffers as
reset to beginning, or otherwise cache aligned; time base alignment can
be done with the freely running ring buffers at the first stage, and
isn't much needed further down the line. There is a cost in latency, but
you can easily round it down to a cacheline, so something like 32-64
bytes. That rounds down nicely from 640 samples, to either ten or twenty
Do make note that most multiprocessor architectures can do either one of
two things: 1) NUMA architectures expose private memory, which doesn't
need any synchronization in-thread, and 2) where uniform, cache coherent
memory access is available, often there are primitives which allow a
certain memory location to be added to coherently. Use the first idea to
add to parts of the addition tree without synchro, use the second part
to synchro with the shared memory part of the tree. Just add in linearly
incresing offset what you have, and then let the cache sort it out.
Otherwise try to maintain a scheduling round which writes linearly in
shared memory and only synchronizes on cache line boundaries.
Now you have at most 80ms buffer for arbitrary timebase alignment and a
fixed Opus blockwidth. It's rather a lot, but you can scale down from
that in two's if you're willing to take the hit in efficiency of
prediction. Your architecture is fully synchronised from the first layer
down the addition tree, and you can use separate cores to populate the
separate input buffers, preferably from separate TCP/UDP streams down
That's about as good as you can get, presuming a fixed block width from
OPUS. Which it has; it isn't a zero delay codec after all.
Sampo Syreeni, aka decoy - decoy at iki.fi, http://decoy.iki.fi/front
+358-40-3751464, 025E D175 ABE5 027C 9494 EEB0 E090 8BA9 0509 85C2
More information about the opus