[Flac-dev] A couple of points about flac 1.1.1 on
ppc/linux/altivec
Chris Csanady
cc at 137.org
Sat Jan 29 19:17:06 PST 2005
I originally did some altivec assembly, but it seems C altivec can be
nearly optimal using carefully constructed loops, and the occasional gcc
extension (labels as values). Considering the various ABI issues,
VRsave,
and gratuitous gnu/apple differences, I have since re-implemented
everything in C.
For comparison, I'm appending a 16 bit C restore function; though the
setup and unaligned logic is typically not nice, the core algorithms are
(I hope) somewhat clear and reasonably small. I have not done much
comparison between the C and asm functions, but I believe that the C
function is very nearly optimal in most cases.
Chris
On 2005/01/29, at 8:40, Brady Patterson wrote:
>
> On Thu, 27 Jan 2005, John Steele Scott wrote:
>> That looks fine to me as well. However, the best solution is
>> something which
>> Luca suggested a few months ago, which is to use the functions
>> defined in
>> altivec.h. These are C functions which map directly to Altivec machine
>> instructions. I am willing to help out, but I don't find the current
>> lpc_asm.s
>> very easy to follow, and my time is quite limited (my last patch to a
>> free
>> software project took almost three months to get into decent shape!).
>
> Is this still my code? IIRC I commented it extensively, but the
> structure is
> certainly non-intuitive.
>
> I'll take a look at it. At the time, I thought I wanted control logic
> that was
> impossible in C, but that may not be the case. It didn't occur to me
> that Linux
> and Apple would use different assemblers; elsewhere Apple uses the GNU
> tools.
> I'm also a bit surprised that people are using flac on an Altivecful
> Linux/PPC
> system (but I did attempt for such a system to fall back to the
> non-altivec C
> code). End digression.
>
> Can you point me to a good reference on altivec.h?
>
> --
> Brady Patterson (brady at spaceship.com)
> RLRR LRLL RLLR LRRL RRLR LLRL
>
> _______________________________________________
> Flac-dev mailing list
> Flac-dev at xiph.org
> http://lists.xiph.org/mailman/listinfo/flac-dev
void FLAC__lpc_restore_signal_16bit_altivec(const FLAC__int32
residual[], unsigned data_len, const FLAC__int32 qlp_coeff[], unsigned
order, int lp_quantization, FLAC__int32 data[])
{
int i, j, *r, *end = (int *)residual + data_len, FLAC__align16
qc[16];
intptr_t do0;
vu8 p;
vs16 qF8, q70, hF8, h70, t;
vs32 r03, s, zero = vec_splat_s32(0);
vu32 lpq;
FLAC__ASSERT(order > 0);
FLAC__ASSERT(VecRelAligned(data, residual));
if (order < 2 || order > 16) {
FLAC__lpc_restore_signal(residual, data_len, qlp_coeff, order,
lp_quantization, data);
return;
}
/* Load lp_quantization into all elements of lpq
*/
VecLoad4(lpq, (unsigned int *)&lp_quantization);
/* qc[] = qlp_coeff[] reversed, aligned, and padded with enough
* zeros to complete the vector.
*/
j = order; i = 16; r = (int *)qlp_coeff;
do {
qc[--i] = *(r++);
} while (--j);
while (i & 3)
qc[--i] = 0;
/* This switch loads the necessary qlp coefficients and data history
* into the q* and h* vectors. They are arranged like so:
* qF8 = qlp[15] - qlp[8], q70 = qlp[7] - qlp[0]
* hF8 = data[-16] - data[-9], h70 = data[-8] - data[-1]
* Loading the data is complicated by the fact that it may not be
vector
* aligned. First, the loads are imlicitly rounded down one
vector. Then,
* the packed vectors need to be shifted so that the actual data is
* aligned at the right. That is the purpose of p here.
*/
p = vec_lvsr(0, (short *)((-(intptr_t)data & 15) >> 1));
r03 = s = zero;
switch (order + 3 & ~3) {
case 16:
r03 = vec_ld(0, qc);
s = vec_ld(-49, data);
case 12:
qF8 = vec_pack(r03, vec_ld(16, qc));
t = vec_pack( s, vec_ld(-33, data));
hF8 = vec_perm( t, t, p);
case 8:
r03 = vec_ld(32, qc);
s = vec_ld(-17, data);
case 4:
q70 = vec_pack(r03, vec_ld(48, qc));
h70 = vec_pack( s, vec_ld(-1, data));
h70 = vec_perm( t, h70, p);
}
/* p is used to shift the history vector to the left one element,
and
* to insert the recently calculated data element s. Keep in mind,
* restore*() only computes one data element at a time: the
vec_sums()
* leaves the sum in the high word, and the remaining calculation
of s
* is entirely serial.
*/
p = (vu8)AVV( 2, 3, 4, 5, 6, 7, 8, 9,10,11,12,13,14,15,30,31);
do0 = (intptr_t)data - (intptr_t)residual - 16; /* -16 for
preincrement */
r = (int *)residual;
r03 = vec_ld(0, residual);
if (order > 8) {
#define restore16(r)
\
s = vec_sums(vec_msum(q70, h70, vec_msum(qF8, hF8, zero)),
zero); \
s = vec_add(r, vec_sra(s, lpq));
\
hF8 = vec_sld(hF8, h70, 2); h70 = vec_perm(h70, (vs16)s, p);
do {
restore16(vec_perm(r03, r03, vec_lvsl(0, ++r)));
} while (!VecAligned(r));
vec_st(vec_unpackl(h70), 0, data);
while (r < end) {
r03 = vec_ld(0, r);
r += 4;
restore16(vec_splat(r03, 0));
restore16(vec_splat(r03, 1));
restore16(vec_splat(r03, 2));
restore16(vec_splat(r03, 3));
vec_st(vec_unpackl(h70), do0, r);
}
#undef restore16
} else {
#define restore8(r)
\
s = vec_sums(vec_msum(q70, h70, zero), zero);
\
s = vec_add(r, vec_sra(s, lpq));
\
h70 = vec_perm(h70, (vs16)s, p);
do {
restore8(vec_perm(r03, r03, vec_lvsl(0, ++r)));
} while (!VecAligned(r));
vec_st(vec_unpackl(h70), 0, data);
while (r < end) {
r03 = vec_ld(0, r);
r += 4;
restore8(vec_splat(r03, 0));
restore8(vec_splat(r03, 1));
restore8(vec_splat(r03, 2));
restore8(vec_splat(r03, 3));
vec_st(vec_unpackl(h70), do0, r);
}
#undef restore8
}
}
More information about the Flac-dev
mailing list