[opus] Opus application_mode==AUDIO, 20ms framing issue?

Mon Jun 13 17:10:05 UTC 2016

Hi Jean-Marc, 

Sorry for late reply, thanks for interest.     It's quality good for 10ms/audio,  poorer for 20ms/audio.  Quality equivalent for 10,20ms for mode=voip.  PESQ was the tool that alerted me to something of interest, but I don't trust PESQ to almost any degree!  It's good for hearing relative differences, of course, but not absolutes.    Bitrate here was 28kbps,  but I hear same thing at 32kbps.

Please find attached a zip file with the audio files, converted to .wavs for simpler listening.   

  https://www.dropbox.com/s/bzu4i3dmg5f91tv/20msAudioModeQuestion.zip?dl=0 <https://www.dropbox.com/s/bzu4i3dmg5f91tv/20msAudioModeQuestion.zip?dl=0>

If there is one single thing to listen to, it would be    

    ar3_20_audio.wav,   loop the section "china hit" starting t=0.6s  and listen for artifacts in the unvoiced speech.  reference is ar3.wav.

and by comparison

    ar2_10_audio.wav   ( same segment, sounds more like the reference ar3.wav)

Here is a cat of the README.txt.   Thanks very much!

16bit, 16kHz input wav files (ar1, ar2, ar3), content from ~50Hz to near 8kHz.
All .pcm files are 16kHz, 16bit, signed ints, little (intel) endian.

./opus_demo -e voip 16000 1 28000  -framesize 20 ~/ar1.wav ar1_20_voip.bit 
./opus_demo -d 16000 ar1_20_voip.bit ar1_20_voip.pcm

opus_demo reports version:    libopus 1.1-alpha

Using recent pesq code compiled from src, +16000 option.
( same phenomenon seen with +16000 +wb option)  

                   5ms      10ms     20ms      40ms

ar1_NN_voip       4.314    4.493    4.488     4.488
ar2_NN_voip       4.346    4.442    4.436     4.474
ar3_NN_voip       3.993    4.375    4.414     4.390

ar1_NN_audio      4.292    4.485 -> 4.313     4.313
ar2_NN_audio      4.364    4.460 -> 4.350     4.350
ar3_NN_audio      3.924    4.327 -> 4.218     4.218

Note that this size/type of pesq test is insufficient to draw ANY conclusions.
However, it is useful for drawing attention to relative differences, that
might be interesting for HUMAN LISTENING.

So the question here was, is this pesq drop from 10ms to 20ms framesize, seen in the 
case of mode=AUDIO (but not VOIP)  something REAL?  It warranted listening.

( same results, interleaved mode=VOIP,AUDIO numbers ) 

                   5ms      10ms     20ms      40ms

ar1_NN_voip       4.314    4.493    4.488*     4.488
ar1_NN_audio      4.292    4.485    4.313*     4.313

ar2_NN_voip       4.346    4.442    4.436*     4.474
ar2_NN_audio      4.364    4.460    4.350*     4.350

ar3_NN_voip       3.993    4.375    4.414*     4.390
ar3_NN_audio      3.924    4.327    4.218*     4.218

same data,  interleaved to highlight fact that drop is seen for same sentences, 
from mode=VOIP to mode=AUDIO,  for 20ms framesize.  (40ms is same processing as 20ms, I believe).

So the  that is implied:
- is there a phenomenon for mode=AUDIO that results in lower scores for 20ms in particular, but not 10ms?

Listening to the processed files (sighted), I have the following subjective opinion:

- Given: sampling rate = 16000,  bitrate = 28000.  (also replicated at 32 kbps)
- the 10ms versions (voip,audio) and the 20ms (audio) version sound "focused" and have high fidelity to the ref.
- the 20ms mode=AUDIO versions sound "hollow", "smeared", "unfocused", especially during unvoiced segments.
- example "china hit" file ar3.pcm, t=0.6s.  Very clear diff between 10ms and 20ms framesize in mode=audio.

This isn't about pesq scores -- pesq was just the "difference noticed" flag that got me to listen to some files.
I notice this same kind of de-focused sound in the same samples processed using recent opus lib in linux.
I'm not surprised at a delta between mode=voip and mode=audio for a constant framesize.  That's entirely expected.
What I'm curious about is the delta between 10ms and 20ms , for mode=audio.  

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xiph.org/pipermail/opus/attachments/20160613/2b54589e/attachment-0001.html>