[Flac-dev] A couple of points about flac 1.1.1 on ppc/linux/altivec

Sat Jan 29 19:17:06 PST 2005

I originally did some altivec assembly, but it seems C altivec can be
nearly optimal using carefully constructed loops, and the occasional gcc
extension (labels as values).  Considering the various ABI issues, 
VRsave,
and gratuitous gnu/apple differences, I have since re-implemented
everything in C.

For comparison, I'm appending a 16 bit C restore function; though the
setup and unaligned logic is typically not nice, the core algorithms are
(I hope) somewhat clear and reasonably small.  I have not done much
comparison between the C and asm functions, but I believe that the C
function is very nearly optimal in most cases.

Chris

On 2005/01/29, at 8:40, Brady Patterson wrote:

>
> On Thu, 27 Jan 2005, John Steele Scott wrote:
>> That looks fine to me as well. However, the best solution is 
>> something which
>> Luca suggested a few months ago, which is to use the functions 
>> defined in
>> altivec.h. These are C functions which map directly to Altivec machine
>> instructions. I am willing to help out, but I don't find the current 
>> lpc_asm.s
>> very easy to follow, and my time is quite limited (my last patch to a 
>> free
>> software project took almost three months to get into decent shape!).
>
> Is this still my code? IIRC I commented it extensively, but the 
> structure is
> certainly non-intuitive.
>
> I'll take a look at it. At the time, I thought I wanted control logic 
> that was
> impossible in C, but that may not be the case. It didn't occur to me 
> that Linux
> and Apple would use different assemblers; elsewhere Apple uses the GNU 
> tools.
> I'm also a bit surprised that people are using flac on an Altivecful 
> Linux/PPC
> system (but I did attempt for such a system to fall back to the 
> non-altivec C
> code). End digression.
>
> Can you point me to a good reference on altivec.h?
>
> --
> Brady Patterson (brady at spaceship.com)
> RLRR LRLL RLLR LRRL RRLR LLRL
>
> _______________________________________________
> Flac-dev mailing list
> Flac-dev at xiph.org
> http://lists.xiph.org/mailman/listinfo/flac-dev

void FLAC__lpc_restore_signal_16bit_altivec(const FLAC__int32 
residual[], unsigned data_len, const FLAC__int32 qlp_coeff[], unsigned 
order, int lp_quantization, FLAC__int32 data[])
{
     int i, j, *r, *end = (int *)residual + data_len, FLAC__align16 
qc[16];
     intptr_t do0;
     vu8 p;
     vs16 qF8, q70, hF8, h70, t;
     vs32 r03, s, zero = vec_splat_s32(0);
     vu32 lpq;

     FLAC__ASSERT(order > 0);
     FLAC__ASSERT(VecRelAligned(data, residual));

     if (order < 2 || order > 16) {
         FLAC__lpc_restore_signal(residual, data_len, qlp_coeff, order,
                 lp_quantization, data);
         return;
     }

     /* Load lp_quantization into all elements of lpq
      */
     VecLoad4(lpq, (unsigned int *)&lp_quantization);

     /* qc[] = qlp_coeff[] reversed, aligned, and padded with enough
      * zeros to complete the vector.
      */
     j = order; i = 16; r = (int *)qlp_coeff;
     do {
         qc[--i] = *(r++);
     } while (--j);
     while (i & 3)
         qc[--i] = 0;

     /* This switch loads the necessary qlp coefficients and data history
      * into the q* and h* vectors.  They are arranged like so:
      *     qF8 = qlp[15] - qlp[8],     q70 = qlp[7] - qlp[0]
      *     hF8 = data[-16] - data[-9], h70 = data[-8] - data[-1]
      * Loading the data is complicated by the fact that it may not be 
vector
      * aligned.  First, the loads are imlicitly rounded down one 
vector.  Then,
      * the packed vectors need to be shifted so that the actual data is
      * aligned at the right.  That is the purpose of p here.
      */
     p = vec_lvsr(0, (short *)((-(intptr_t)data & 15) >> 1));
     r03 = s = zero;
     switch (order + 3 & ~3) {
     case 16:
         r03 = vec_ld(0, qc);
         s   = vec_ld(-49, data);
     case 12:
         qF8 = vec_pack(r03, vec_ld(16, qc));
         t   = vec_pack(  s, vec_ld(-33, data));
         hF8 = vec_perm(  t, t, p);
     case 8:
         r03 = vec_ld(32, qc);
         s   = vec_ld(-17, data);
     case 4:
         q70 = vec_pack(r03, vec_ld(48, qc));
         h70 = vec_pack(  s, vec_ld(-1, data));
         h70 = vec_perm(  t, h70, p);
     }

     /* p is used to shift the history vector to the left one element, 
and
      * to insert the recently calculated data element s.  Keep in mind,
      * restore*() only computes one data element at a time: the 
vec_sums()
      * leaves the sum in the high word, and the remaining calculation 
of s
      * is entirely serial.
      */
     p = (vu8)AVV( 2, 3, 4, 5, 6, 7, 8, 9,10,11,12,13,14,15,30,31);

     do0 = (intptr_t)data - (intptr_t)residual - 16; /* -16 for 
preincrement */
     r = (int *)residual;
     r03 = vec_ld(0, residual);

     if (order > 8) {
#define restore16(r)                                                    
        \
         s = vec_sums(vec_msum(q70, h70, vec_msum(qF8, hF8, zero)), 
zero);      \
         s = vec_add(r, vec_sra(s, lpq));                                
        \
         hF8 = vec_sld(hF8, h70, 2); h70 = vec_perm(h70, (vs16)s, p);
         do {
             restore16(vec_perm(r03, r03, vec_lvsl(0, ++r)));
         } while (!VecAligned(r));
         vec_st(vec_unpackl(h70), 0, data);
         while (r < end) {
             r03 = vec_ld(0, r);
             r += 4;
             restore16(vec_splat(r03, 0));
             restore16(vec_splat(r03, 1));
             restore16(vec_splat(r03, 2));
             restore16(vec_splat(r03, 3));
             vec_st(vec_unpackl(h70), do0, r);
         }
#undef restore16
     } else {
#define restore8(r)                                                     
        \
         s = vec_sums(vec_msum(q70, h70, zero), zero);                   
        \
         s = vec_add(r, vec_sra(s, lpq));                                
        \
         h70 = vec_perm(h70, (vs16)s, p);
         do {
             restore8(vec_perm(r03, r03, vec_lvsl(0, ++r)));
         } while (!VecAligned(r));
         vec_st(vec_unpackl(h70), 0, data);
         while (r < end) {
             r03 = vec_ld(0, r);
             r += 4;
             restore8(vec_splat(r03, 0));
             restore8(vec_splat(r03, 1));
             restore8(vec_splat(r03, 2));
             restore8(vec_splat(r03, 3));
             vec_st(vec_unpackl(h70), do0, r);
         }
#undef restore8
     }
}