[tremor] [PATCH] significantly reduce Tremor ROM size requirement

Thu Sep 5 10:32:44 PDT 2002

On Thu, Sep 05, 2002 at 12:50:07PM -0400, Nicolas Pitre wrote:

> Well, say 30% vs 27% CPU usage on some ARM board I have here.  I'm expecting
> this difference can be reduced with proper compiler flags though.

Ah.  Depending on the profiling tools, 3% may even be in the range of
statistical error.  It certainly is for gprof.

> > This doesn't solve the size issue, it just pushes it to somewhere
> > else.  It also increases startup time and requires trigonometry
> > approximations I prefer to avoid.
> 
> Why so?  AFAICS the accuracy didn't change.

Approximations, when done well, wouldn't affect output much, no.

> Well if your CPU isn't fast enough to build a couple tables with lookups and 
> interpolation you won't sustain a real time decoding of each audio frames 
> requiring many more cycles.

Fallacy.  Init is already heavy weight, we need to be moving
proverbial straws off the camel's back, not adding more.

> Please get real.  Did you _really_ look at actual numbers?

I'll explain in more detail why the patch doesn't buy anything. 

One reason the static tables are so large are the 4096/8192 block
sizes.  When dynamically allocating only the two sizes you need,
naturally you don't have any extra data laying around temporarily
unused.  However these block sizes *are part of the spec* and will
occur.  Either you have the space to hold the 4096/8192 tables, or you
don't, and it doesn't matter whether it's static data, or on the heap.

With your patch, if you don't have space for the large tables, you
won't be able to dynamically allocate them and decode will fail.
Without your patch, if the static data segment is too large and the
codec won't fit, decode will fail.  Either way, decode fails.  No
change. 

If static data is really that tight, the largest tables can simply be
left out, rendering Vorbis files that use those tables unable to
decode.  Is it best to know that ahead of time, or run into
unexpectedly?  Generally it's good to know ahead of time what you can,
but be able to handle the unexpected event (while minimizing
unexpected events).

Thus, in the big picture your patch is not an improvement and does not
affect the global memory requirements.  It does not increase CPU
performance.  Eliminating 32kB of core usage for the common case is
meaningless when you've not affected the maximum required number that
must be accounted for.  Nor is the patch a step toward algorithmic
optimization of the MDCT, which is what this MDCT actually needs.

If you can, say, halve the size of the table for any given block and
demonstrate that the change does not affect S/N ("I'm sure it's OK" is
not good enough), then that is a patch I'm interested in.  However,
the patch that will make me drop what I'm doing instantly is the one
that substantially eliminates the tables without approximations.

> > The proper thing to do here is eliminate the need for most of that
> > table entirely, not move it to dynamic allocation.  Currently,
> > Tremor's MDCT table contains some easy-to-eliminate redundancy. Some
> > harder work would eliminate the table almost entirely.
> 
> Well it's easy to keep only one window table and interpolate inside it 
> according to the block size too.

there's some of that, yes, but we can likely eliminate interpolation
as well.  The real fix here is not incremental bit-twiddling.  The
real fix here is improving the math itself.

> But at least for now you could consider this patchlet which shouldn't be 
> controversial:
> 
> diff -urN orig/Tremor/floor0.c Tremor/floor0.c
> --- orig/Tremor/floor0.c	Mon Sep  2 23:15:19 2002
> +++ Tremor/floor0.c	Wed Sep  4 16:32:07 2002
> @@ -117,21 +118,21 @@
>    }
>  }
>  
> -static int MLOOP_1[64]={
> +static unsigned char MLOOP_1[64]={
>     0,10,11,11, 12,12,12,12, 13,13,13,13, 13,13,13,13,
>    14,14,14,14, 14,14,14,14, 14,14,14,14, 14,14,14,14,
>    15,15,15,15, 15,15,15,15, 15,15,15,15, 15,15,15,15,
>    15,15,15,15, 15,15,15,15, 15,15,15,15, 15,15,15,15,
>  };
>  
> -static int MLOOP_2[64]={
> +static unsigned char MLOOP_2[64]={
>    0,4,5,5, 6,6,6,6, 7,7,7,7, 7,7,7,7,
>    8,8,8,8, 8,8,8,8, 8,8,8,8, 8,8,8,8,
>    9,9,9,9, 9,9,9,9, 9,9,9,9, 9,9,9,9,
>    9,9,9,9, 9,9,9,9, 9,9,9,9, 9,9,9,9,
>  };
>  
> -static int MLOOP_3[8]={0,1,2,2,3,3,3,3};
> +static unsigned char MLOOP_3[8]={0,1,2,2,3,3,3,3};

This is static data in the heaviest-weight tight loop in all of
Vorbis; the loop accounts for 50% of CPU usage for beta-1 and beta-2
files.  Going int->char affects GCC-ARM's memory addressing strategy
dramatically; in gcc < 3.0, it generally affects it negatively.  Is
there a reson to do this aside from saving 100 bytes?  Do you have
performance figures from a few processors/compilers to justify it?

Monty
--- >8 ----
List archives:  http://www.xiph.org/archives/
Ogg project homepage: http://www.xiph.org/ogg/
To unsubscribe from this list, send a message to 'tremor-request at xiph.org'
containing only the word 'unsubscribe' in the body.  No subject is needed.
Unsubscribe messages sent to the list will be ignored/filtered.