[Speex-dev] 2 questions, frame size and SPEEX_GET_LOOKAHEAD

Tue Oct 31 15:40:58 PST 2006

[At the risk of educating you about something you might already know]

Natural speech in most human languages gradually changes from one 
phoneme to the next.

Concatenating phonemes together from a fixed, prerecorded, unflexible 
set would give rise to abrupt changes between them (both in phoneme 
quality and in pitch), and thus make the resulting speech hard to 
understand and/or uncomfortable to listen to.

Most flexible (unlimited vocabulary), unit (e.g. "phoneme") 
concatenation speech synthesizers therefore use some strategy to blend 
the pieces of speech together, usually both in pitch and in phoneme 
quality. One very conceptually simple and therefore popular approach is 
storing "diphones" - phoneme transitions: e.g. the second half of "a" 
and the first half of "p" from the hypothetical word "apa". Since 
phonemes usually tend to reach their "most recognizable" state in the 
"middle", cutting and splicing them together around that point should 
minimize the amount of discontinuity.

Obviously, if you concatenate speech from larger units (words, phrases, 
or even sentences) ensuring acoustical continuity becomes less and less 
of an issue, but you specifically mention phonemes.

So unless you want to use Speex to (re)implement unit storage for a 
speech synthesizer that handles these issues, I suggest you take a look 
at the available literature on speech synthesis.

Wikipedia seems to be a reasonable starting point: 
http://en.wikipedia.org/wiki/Speech_synthesis

Jia Pu írta:
> Ok, let me first explain why 5ms matters, even they are 0's, in my 
> particular application.
> 
> I am working on a speech synthesis system. The basic idea is 
> concatenating pre-recorded phonemes or words into longer sentences. So 
> any missing or extra samples, even it is as short as 5~10ms, cause very 
> noticeable discontinuities.
> 
> I want to use speex to compress/decompress those pre-recorded material. 
> But I'm concerned about the extra 0's might be padded at both ends.
> 
> For the zero padding at the last frame, I know how to remove it after 
> decoding. But I am a little confused by the look ahead at the beginning. 
> The sample code in the manual doesn't use look ahead, while the 
> speexenc.c does. I'd like to know what difference it makes.
> 
> Let me plugin some numbers. I am using wide-band mode, the frame size is 
> 320 samples. Say I take the first frame of an audio buffer, i.e. the 
> first 320 samples, and feed them into encoder. Then after decompress, do 
> I get all 320 samples, or a portion of 320 samples with some 0 padding 
> at the very beginning?
> 
> Thank you.
> 
> 
> 
> 
> On Oct 31, 2006, at 12:38 PM, Jean-Marc Valin wrote:
> 
>>> In my application, even 5ms (110 samples at 22KHz) matters.
>>
>> 1) If 5 ms matters, I don't recommend Ogg (and I definitely hope you're
>> not running Windows!)
>> 2) 22 kHz is *not* recommended. Use 16 kHz instead
>>
>>> So what
>>> should I do to avoid discarding samples at the beginning?
>>
>> Why are they so precious, they're *zeros* (or nearly).
>>
>>> 1. Turning off look ahead?
>>> 2. Padding 0's at the beginning.
>>
>> Or you can always just play them if they're so precious. Ah, the sound
>> of 5ms worth of zeros...
>>
>>     Jean-Marc
> 
> _______________________________________________
> Speex-dev mailing list
> Speex-dev at xiph.org
> http://lists.xiph.org/mailman/listinfo/speex-dev