[theora-dev] VC2005 MMX patch.

Wed Dec 26 10:56:37 PST 2007

Nils Pipenbrinck wrote:
> Regarding x86-64: The MSVC helpfile tells me that the MMX intrinsics
> with 64 bit operands won't be supported for the 64 bit compiler, so
> additional work is required anyways  (for whatever reason they decided
> to do that).

Microsoft seems to be going out of its way to convince everyone that it
is totally unready for 64-bits. I don't know if this is just
institutional incompetence or something else, but when compared to the
current state of 64-bit Linux it is fairly ridiculous.

> I only did logical changes in the files that are for visual studio. The
> few changes in the other files where required to get a theora building
> with the USE_ASM macro.

I meant stuff like generating constants with 2 mov's and a punpckldq
instead of a pcmpeq, psllw, psrlw, or your unbiased average trick, or
manually unrolling oc_frag_recon_intra_mmx, or...

The point is that with a first-time contributor, maintainability of the
code is paramount above all else, and that is much easier if I can match
instruction by instruction your code with the existing code. If it makes
more sense to change the gcc version to match yours in places, I'm
willing to do that, but I'd like to know why.

Also, is it really worth it to do so many loads early (and serially,
since, e.g., the Core2 only has a single memory read port but 3 (well, 2
and a half) arithmetic ports), instead of eating the latency up front
and trying to keep the instruction mix more diverse? A similar question
applies to the serial stores later.

> MSVC does neither inline any function that contains raw assembler nor
> does it allow macro expansion inside assembler. The IDCT source from GCC

That's insane. Can you at least do something along the lines of
#define FOO _asm{ \
 ... \
}
and build up sequences in separate asm blocks? Or does it insert lots of
garbage housekeeping instructions between them because it doesn't bother
to track which registers are actually used in a block and tries to
preserve everything.

> Simply because the code above is for a single byte. It extends well to a
> full machine-word as well, and in this case you need the AND to prevent
> the LSB of byte1 to shift into MSB of byte0. This makes only sense if
> you process more than one byte at a time of cause.

Sorry... I keep forgetting that MMX has next to no byte-level
instructions, so there's no pslrb. I should learn to read.

> The 32 bit version of the above function should make it clear (hint
> hint: this would be a easy to do improvement for oc_frag_recon_inter2_c)
> 
> ogg_uint32_t pavgub4 (ogg_uint32_t a, ogg_uint32_t b)
> {
>  return (a & b) + (((a ^ b) & 0xfefefefe) >> 1);
> }
> 
> I'll do a measurement how it performs against the "add and
> shift"-version from the gcc sources.. Just to be sure that it is faster.

Also, there's an ogg_uint64_t type, so there's no reason to stop at a
32-bit version.

Feel free to optimize the C code as well (in separate patches)... I'm
sure there's plenty of low-hanging fruit. I've done pretty much all of
the algorithmic optimizations I had planned, but little low-level stuff
at all, besides the obvious "avoid unpredictable branches" (which even
those need to be benchmarked to be sure they aren't killing us on things
like the OLPC). There may also be some benefit in re-organizing
structures to fit into cache lines better, which I was planning to take
a look at some day, but haven't.