[CELT-dev] Mixing of compressed streams

Tue Nov 29 03:44:01 PST 2011

Thanks gents for the thoughtful replies.

My application is for very high quality music in an embedded processor
system. CPU cycles a very precious. I am mixing 6 incoming streams of (mono)
audio. Currently I decode then mix, which works fine. However, while reading
through a Microsoft patent (7,460,495) I noticed the inventor's reference to
mixing encoded frames. Since there are significantly fewer bytes of
compressed data than there are unencoded 2-byte samples per frame, I assumed
that mixing the compressed bytes would result in a more efficient mixing
method. Perhaps this is not the case. It would make for an interesting
experiment:)

Your suggestion of "cut-through" mixing has been used for years. It's been
called "voice activity detection" (Statacom/Cisco) as well as other names.
You may want to investigate IPR on this before putting too much effort into
it.

Also, your comments about the fact that neither CELT nor Opus has been
"optimized" is interesting. I would love to see some additional optimization
done at some point in the future. Again, in battery powered, portable
embedded systems every CPU cycle is precious. Currently the limiting factor
in my design are the encode and decode times of the codec.

Thanks again for your great work! 

MikeH

-----Original Message-----
From: Gregory Maxwell [mailto:gmaxwell at gmail.com] 
Sent: Monday, November 28, 2011 7:45 PM
To: bens at alum.mit.edu
Cc: Mike Hooper; celt-dev at xiph.org
Subject: Re: [CELT-dev] Mixing of compressed streams

On Mon, Nov 28, 2011 at 6:18 PM, Benjamin M. Schwartz
<bmschwar at fas.harvard.edu> wrote:
> Of course it always possible: just decompress, mix, and recompress!
>
> The usual question is: how much CPU does this cost, and can we save some
> by not fully decoding the streams?
> The answer is:
> 1.  It doesn't cost a lot of CPU.  Opus is designed to be CPU-efficient
> for both encode and decode.
> 2.  The best way to save CPU is probably to optimize the encoder and
> decoder, which presently have very little in the way of performance
> optimizations.
> 2a.  Once the encoder and decoder are highly optimized, then you can
> probably save an addition 10-20% of CPU time by implementing a
> transform-domain mixer, provided that the two streams are in the same
mode.
>
> The problem with 2a is to convince yourself that the CPU time is really
> more expensive than the engineers' salaries.

Great observation with respect the best way to speed up mixing is to
speed up the codec. This is very true, especially now due to the
relative lack of effort that has gone into platform specific
performance optimization, though I wouldn't have thought of it.

One possibility you did not mention is the kind of 'mixing' where you
hard switch based on activity, e.g. automatic half-duplex cut-through
mixing, whoever is talking is 100% of the output (presumably for
everyone but them). You can do this cheaply without actually mixing
(though with some possibility of glitching)... if we provided a
function to give you an activity level from a stream without doing a
full decode.

I think this would be less work than many other things— it's mostly
copy and pasting to make a really abbreviated decoder... though it's
not suitable for mixing except in the most constrained environments.