[flac-dev] Autocorrelation precision insufficient

Thu Jun 24 07:17:25 UTC 2021

Hi all,

Recently I've been investigating various ways to improve FLAC
compression, and now I've stumbled upon quite a small change with
large implications.

Flake, an alternative compressor using the FLAC format, has always
provided better compression than FLAC. I've found out why: Flake uses
doubles (64-bit floating point) for calculating autocorrelation
values, while FLAC uses regular floats (32-bit floating point). The
largest problem with implementing this, is that intrinsics routines
(for SSE and VSX) have to be rewritten. I've done quite a bit of
testing and comparing, see the next two PDFs.

http://www.audiograaf.nl/misc_stuff/double-autoc-with-sse2-intrinsics.pdf
http://www.audiograaf.nl/misc_stuff/double-autoc-with-sse2-intrinsics-per-track.pdf

There are four lines, all going from setting -4 as the rightmost
(fastest) through -5, -6, -7 to -8 as the leftmost (slowest).
- darkblue line is current git
- green line is current git but with SSE intrinsics for
autocorrelation calculation disabled
- lightblue line is calculating autocorrelation in doubles instead of real
- red line is calculating autocorrelation in doubles but with new SSE2
intrinsics routines

As you can see in the PDFs, the overall gain for setting -4 is large
(0.3%point or 0.5%) with minimal slowdown. This gain grows smaller
while the slowdown increases with increasing setting. The -per-track
PDF shows that the gain is highly dependent on the kind of audio that
is being compressed. Tracks with strong tonal components, like piano
music (14 and 15) benefit the most. Orchestral music (2, 6, 10 and 9)
and electronic music (4 and 13) benefit in varying degrees. Music with
much more noisy content, like metal (3, 5 and 12) have (almost) no
benefit. However, in the tracks that benefit, gains can be large.
Track 15, which is piano music, sees a gain of 2.2%point or 5% for
setting -4 and 1%point or 2% for -8.

Code is here: https://github.com/ktmf01/flac/tree/autoc-sse2 Before I
send a push request, I'd like to discuss a choice that has to be made.
I see a few options
- Don't switch to autoc[] as doubles, keep current speed and ignore
possible compression gain
- Switch to autoc[] as doubles, but keep current intrinsics routines.
This means some platforms (with only SSE but not SSE2 or with VSX)
will get less compression, but won't see a large slowdown.
- Switch to autoc[] as doubles, but remove current SSE and disable VSX
intrinsics for someone to update them later (I don't have any POWER8
or POWER9 hardware to test). This means all platforms will get the
same compression, but some (with only SSE but not SSE2 or with VSX)
will see a large slowdown.

Thanks in advance for your replies and comments on this.

Kind regards,

Martijn van Beurden