[vorbis-dev] faster mdct's

Tue Jun 3 13:12:55 PDT 2003

Michael Smith wrote:
> We can't use any of your code as it stands anyway (at least the one
> you attached, I haven't checked FFTW generally) because it's GPLed.
> Obvioulsy relicensing FFTW to xiph.org's bsd-like license isn't an
> option, but I'm not sure about what the licensing status of the
> generated code (as, for example, the code you gave a link to:
> jdj.mit.edu/~stevenj/mdct_128nr.c) is. Would it be usable for us?

>From MIT's perspective, the generated code ("codelets") is not
copyrighted.  (One can put a compilation copyright on a collection of
generated codelets, I think, but if someone regenerates one then they are
unencumbered.)  So, you can do whatever you want with mdct_128nr.c.

Regarding the other code that I attached, for arbitrary sizes, it's
possible that we could arrange a waiver to allow that particular file to
be licensed under a BSD license.  The version I attached links to the rest
of FFTW, of course, but all it really needs for a single size like 2048 is
a couple of codelets; it could easily be hacked to call those (generated,
non-copyrighted) codelets by hand.  Also, as I mentioned, if you really
care about performance for n=2048, the right thing is to modify the
generator to spit out special-purpose recursive/looping MDCT code
directly, so as to avoid the pre-/post-processing passes.  This would
avoid the copyright issue, since you'd be working almost entirely with
generated code.  However, I don't have time to do this work with the
generator right now, myself.

John Ripley wrote:
> Michael Smith wrote:
> > MDCT performance is insignificant on encode, but takes a
> > substantial (not sure what percentage, but it's non-trivial)
> > amount of time on decode. However, decode is not really a
> > performance problem on 'desktop-class' cpus - it only
> > really matters a lot for embedded use
>
> Still, it does have an application if the code generator could be
> modified to use integer-only instructions on other architectures (e.g
> ARM, or a DSP like 56k). Saving 50% in the MDCT would translate to
> having something like (ball park) 20% more battery life on portables,
> for example. My only concern there is the size of code generated -
> portables tend to have little memory available both for code and data.

We've talked about modifying the generator to spit out integer code;
basically this means adding some extra information to the symbolic nodes
in the generator to keep track of the decimal-point shift, and shouldn't
be terribly difficult.  (At least to do straightforward shifting, like the
your current integer MDCT, from what I recall when I looked at it.  You
can save a few more bits if you make some assumptions about the statistics
of the signal, but I guess it's not worth it if you have 32 bits to play
with.)

You're right that the code-size tradeoff may well be different on an ARM
or whatever runs these portables.  To handle n=2048, however, one would
ideally want to modify the generator to support the recursive case as I
mentioned above, so you could do this for n=256 as well in order to use
smaller codelets.

I'm a little too hosed this month to attempt an integer transplant on the
generator right now, but maybe my co-author Matteo is interested (he wrote
the generator, anyway, so he can hack it much more easily).

(Not that being hosed has prevented me from goofing off with FFTW before. =)

> Block sizes 256 and 2048 are the most common but not the only ones
> encountered, so technically you would need an MDCT for all sizes from
> 64 to 8192. It would be interesting if you could avoid this by
> generating the code on the fly, but then the code generator would have
> to be smaller than the total of all MDCTs :)

The code generator is probably too slow for this, especially on an
embedded machine.  =) My thinking was that you could use
generated/optimized code for the important common cases, and stick with
your generic code for other cases.  Of course, once the generator is
modified to handle n > 256 efficiently (by generating the appropriate
recursive/looped cases), it's not a problem to generate all 8 sizes from
64 to 8192.

> So, how hard would it be to get your code generator integer-only

Not too difficult; no more than a few days, I would expect.

> and non-x86? It would certainly save a hell of a lot of hand optimisation.

It's already non-x86, although we've talked about generating assembly
instead of C.  (It turns out that for FFT-like algorithms one can find an
asymptotically optimal, and pretty darn good in practice, schedule
regardless of the number of registers, so the C code is pretty portable.  
Then, of course, you have to trick the compiler into not screwing it up,
which is becoming increasingly difficult.  ^_^ )

Steven

PS. Please cc me; I can read the mailing-list archives online, but it's
easier to respond to email.

--- >8 ----
List archives:  http://www.xiph.org/archives/
Ogg project homepage: http://www.xiph.org/ogg/
To unsubscribe from this list, send a message to 'vorbis-dev-request at xiph.org'
containing only the word 'unsubscribe' in the body.  No subject is needed.
Unsubscribe messages sent to the list will be ignored/filtered.