[tremor] [PATCH] significantly reduce Tremor ROM size requirement

Thu Sep 5 12:17:45 PDT 2002

On Thu, 5 Sep 2002, Monty wrote:

> If static data is really that tight, the largest tables can simply be
> left out, rendering Vorbis files that use those tables unable to
> decode.  Is it best to know that ahead of time, or run into
> unexpectedly?  Generally it's good to know ahead of time what you can,
> but be able to handle the unexpected event (while minimizing
> unexpected events).

Fair enough.

> > But at least for now you could consider this patchlet which shouldn't be 
> > controversial:
> > 
> > diff -urN orig/Tremor/floor0.c Tremor/floor0.c
> > --- orig/Tremor/floor0.c	Mon Sep  2 23:15:19 2002
> > +++ Tremor/floor0.c	Wed Sep  4 16:32:07 2002
> > @@ -117,21 +118,21 @@
> >    }
> >  }
> >  
> > -static int MLOOP_1[64]={
> > +static unsigned char MLOOP_1[64]={
> >     0,10,11,11, 12,12,12,12, 13,13,13,13, 13,13,13,13,
> >    14,14,14,14, 14,14,14,14, 14,14,14,14, 14,14,14,14,
> >    15,15,15,15, 15,15,15,15, 15,15,15,15, 15,15,15,15,
> >    15,15,15,15, 15,15,15,15, 15,15,15,15, 15,15,15,15,
> >  };
> >  
> > -static int MLOOP_2[64]={
> > +static unsigned char MLOOP_2[64]={
> >    0,4,5,5, 6,6,6,6, 7,7,7,7, 7,7,7,7,
> >    8,8,8,8, 8,8,8,8, 8,8,8,8, 8,8,8,8,
> >    9,9,9,9, 9,9,9,9, 9,9,9,9, 9,9,9,9,
> >    9,9,9,9, 9,9,9,9, 9,9,9,9, 9,9,9,9,
> >  };
> >  
> > -static int MLOOP_3[8]={0,1,2,2,3,3,3,3};
> > +static unsigned char MLOOP_3[8]={0,1,2,2,3,3,3,3};
> 
> This is static data in the heaviest-weight tight loop in all of
> Vorbis; the loop accounts for 50% of CPU usage for beta-1 and beta-2
> files.  Going int->char affects GCC-ARM's memory addressing strategy
> dramatically; in gcc < 3.0, it generally affects it negatively.  Is
> there a reson to do this aside from saving 100 bytes?  Do you have
> performance figures from a few processors/compilers to justify it?

Here's a snapshot of the generated assembly difference on gcc-2.95.3:
(-) lines with int
(+) lines with unsigned char

 .L208:
        orr     r3, r4, lr
        ldr     r0, .L239+16
-       mov     r2, r3, lsr #25
-       ldr     ip, [r0, r2, asl #2]
+       ldrb    ip, [r0, r3, lsr #25]   @ zero_extendqisi2
        cmp     ip, #0
        bne     .L214
        ldr     r1, .L239+20
-       mov     r2, r3, lsr #19
-       ldr     ip, [r1, r2, asl #2]
+       ldrb    ip, [r1, r3, lsr #19]   @ zero_extendqisi2
        cmp     ip, #0
-       bne     .L214
-       mov     r2, r3, lsr #16
-       ldr     r3, .L239+20
-       ldr     ip, [r3, r2, asl #2]
+       ldreq   r2, .L239+12
+       ldreqb  ip, [r2, r3, lsr #16]   @ zero_extendqisi2
 .L214:
        ldr     r3, [fp, #-76]
        cmp     r3, #0
        beq     .L216
        [...]

Of course GCC-ARM's memory addressing strategy is affected but rather
positively in my opinion.  Not only it emit 4 fewer instructions in this
particular case, but you'll also get much better cache usage for the byte
array.  And if I remember correctly, a ldrb has the same cycle count as a
ldr.  Just checked with gcc-3.2 and the same pattern exists there.  It's
mostly always easier to index a byte array than any other larger element.  
In what case did you observe a negative impact?

<p>Nicolas

--- >8 ----
List archives:  http://www.xiph.org/archives/
Ogg project homepage: http://www.xiph.org/ogg/
To unsubscribe from this list, send a message to 'tremor-request at xiph.org'
containing only the word 'unsubscribe' in the body.  No subject is needed.
Unsubscribe messages sent to the list will be ignored/filtered.