[Vorbis-dev] Question about blocksizes

Tue Jan 17 09:08:45 PST 2006

On Tue, Jan 17, 2006 at 04:48:47PM +0100, Nico Sabbi wrote:
> can someone explain whats the meaning of the two blocksizes in the first 
> header of Vorbis, please?
> So far I assumed that they meant that 2^b0 and 2^b1 were the only two 
> blocksizes used during
> the whole encode, but something makes me believe they are not:
> if b0 and b1 are 0xb8 respectively (that I interpeted as 2^11 = 2048
> and 2^8 = 256) I observe 3 different deltas between each couple of 
> consecutive granulepos:
> 1024, 128 and 576.
> Do I have to understand that b0 and b1 indicate 2^(b0-1) and 2^(b1-1) 
> blocks?
> Are they the min and max values used, rather than the only two?
> If no, where does that 576 stem from? It's not even a power of 2.

When decoding Vorbis data from adjacent audio frames is overlapped and
summed to produce the final output.  Thus, the blocksize doesn't give
the number of new audio samples ready for output (measured by
granulepos), as some of the samples need to be combined with data from
the next frame.

Section 1.3.2.3 of the Vorbis I specification
(http://www.xiph.org/vorbis/doc/Vorbis_I_spec.html) illustrates this
nicely.

In the simple case of equal blocksizes, the last half of each frame
overlaps with the first half of the next, so the granulepos will advance
by half the blocksize.

When two adjacent blocks are of an unequal size, the situation is more
complicated.  The data that can be output lies between the middle of the
previous frame and the middle of the current frame--if you look at the
figure, when reaching the middle of the current frame, the window
applied to the previous frame drops to 0, so this is the point at which
the previous frame stops contributing to the final output samples.  Data
after the midpoint of the current frame needs to be saved to be
overlapped with the next frame of data, and so cannot yet be output.

Since the 3/4 point of the previous frame is always aligned with the 1/4
point of the current frame, if you actually do the math you'll find
(perhaps after staring at the figure for a bit--it took me a little
while to get it at first) that the number of audio samples that can be
output after decoding a frame is
    previous_blocksize/4 + current_blocksize/4

In your example, where the blocksizes are 2048 and 256, this gives three
possible amounts of audio data that can be produced: 1024, 128, and 576.

--Michael Vrable