[Flac-dev] More Altivec/PPC Stuff...

Sat Oct 30 06:40:33 PDT 2004

Sorry that it has been a while since the last altivec patch.  I have  
noticed something interesting,
and so it remains unfinished...

On the ppc, even with the altivec optimizations, almost a quarter of  
the time is spent in
FLAC__stream_encoder_process().  I finally discovered that it is  
because of all the integer to
float conversions.  Aside from being exceptionally slow on the g4, they  
will cause a ton of load
store rejects on the 970, making matters even worse there.

Since the single precision float conversion is much more efficient in  
altivec, I have hacked the
FLAC__compute_autocorrelation_altivec() function to take an integer  
signal, not even computing
real signal at all.  Is this ok?  It doesn't seem to affect anything  
else, though I admit it is ugly...

Anyways, the overall improvement is about 5%  at -8, and 15% at  
defaults.  In both cases, with
this hack, the altivec version is now about 45% faster.

What's left of a default encode is shown below. :)  It seems that most  
of the remaining time
is consumed by the rice coding...

	25.7%	25.7%	flac	FLAC__bitbuffer_write_raw_uint32
	11.0%	11.0%	flac	FLAC__bitbuffer_write_rice_signed
	10.8%	10.8%	flac	FLAC__MD5Accumulate	
	6.9%	6.9%	flac	set_partitioned_rice_	
	6.9%	6.9%	flac	FLAC__stream_encoder_process
	6.8%	6.8%	flac	find_best_partition_order_	
	5.6%	5.6%	flac	FLAC__MD5Transform	
	4.8%	4.8%	flac	FLAC__fixed_compute_best_predictor_altivec
	4.2%	4.2%	flac	format_input	
	2.9%	2.9%	flac	FLAC__lpc_compute_autocorrelation_altivec
	2.0%	2.0%	flac	FLAC__fixed_compute_residual_altivec
	1.9%	1.9%	flac	FLAC__crc16	
	1.8%	1.8%	mach_kernel	ml_set_interrupts_enabled
	1.5%	1.5%	flac	FLAC__lpc_compute_residual_from_qlp_coefficients_altivec
	1.2%	1.2%	flac	 
FLAC__lpc_compute_residual_from_qlp_coefficients_16bit_altivec	
For fun, I wrote a fast signed rice implementation, though I have yet  
to adapt it to the bitbuffer.

Also, for those interested, I came across a very nice arithmetic coding  
implementation at:

	http://www.cipr.rpi.edu/~said/FastAC.html

With a very crude adaptive model, it comes fairly close to the  
partitioned rice scheme, though I'm
betting it would be considerably faster, and a lot simpler.  Perhaps it  
is worth some more
investigation; it really is elegant compared to the others I've seen.  
(Unfortunately, it is written in
the hideous language that is C++, but thankfully uses a fairly  
reasonable subset of it.)

Chris