[CELT-dev] Opus for audiobooks etc

Thu Nov 17 12:42:00 PST 2011

On Thu, Nov 17, 2011 at 2:41 PM, Daniel Jensen <jensend at iname.com> wrote:
> The only comment I've seen about use of Opus for audiobooks was jmvalin
> saying in response to someone on his blog that Opus's ability to do
> fullband would be a key advantage here. This seems kind of
> counterintuitive to me- can people even ABX human speech at a 32 or even
> 24kHz sample rate from speech at 48kHz, much less hear a large quality
> difference? A number of audiobooks I've listened to have used 22kHz mp3s
> without being clearly objectionable, and in my personal use I've had
> decent results using the -voice LAME setting (downsamples to 32kHz and
> encodes as 56kbps abr).

22kHz speech isn't "objectionable", but it's trivially ABXable, at least
if the speech was recorded with full bandpass.  32KHz vs 48KHz may
not be ABX-able for speech (or even for music for many adults!), but you
get the extra extension for free in opus.

The low bandpass can be objectionable for the music parts in mixed content.

Keep in mind that communication codecs are usually do a wideband at 16KHz,
which sounds clearly and obviously worse for speech. (Although not
objectionable)

Vs MP3 opus is just a lot more efficient.

[snip]
> Hoene's results showed it losing pretty convincingly to AMR-WB+ (which
> was able to use 4x larger frame sizes) at 32kbps. (How much of this was
> due to the test being stereo, I wonder? Some mono tests seem to have
> given 32kbps Opus rather high marks.)

IIRC he was testing some rather torturous samples with different speakers
running concurrently in different ears— and at rates lower than we'd
recommend for general stereo. I believe the goal was mostly to make sure
the codec didn't blow up or perform too terribly.

(You can do things like pan-potted mono down to lower rates in opus, but
full stereo needs some more bitrate).

The encoder is now more aggressive at flattening the audio to to mono
at very low rates.

> For audiobook use, I don't know that the SILK modes or anything else
> with that low of a bitrate will be good enough, and when you're storing
> hundreds of hours of speech 64kbps adds up fast. I'd guess the sweet
> spot for audiobooks would be between 20 and 32 kbps, and this seems to
> my unschooled understanding to be a region where Opus's low delay might
> put it at a serious disadvantage.

Well... Disadvantage compared to what?

If you're able to get licenses for USAC under $2 per decoder I'll be
surprised. At those rates it may well turn out to work better.
Hundreds of milliseconds of delay can be helpful. :) Considering the
licensing and the wider use cases (VoIP as well as high delay stuff) I
hope and expect Opus to be much more widely deployed.

If your comparison points are Vorbis, MP3, Speex (or other
pure-communication codec), or AAC it should be no contest.

> Other than just being curious in general about what folks have to say
> about audiobook use, I'm curious about one thing in particular-- how
> feasible would it be to use larger frame sizes (e.g. matching SILK
> mode's 60ms maximum) for Opus, especially for the hybrid mode, and what
> would the potential for improved quality be?

Audiobook use was a consideration for us and it was one of the drivers behind
the codec's ability to do seamless mode switching.

Our higher latency modes (>>20ms) are mostly about reducing IP/UDP/RTP
overhead, an issue you won't have for Opus in Ogg.

For your application the improvements for encoder VBR and automatic speech
detection that we're currently working on will probably be relevant. This
use case would probably also benefit from additional look-ahead in the encoder,
(and potentially two-pass rate control)