[theora-dev] VC2005 MMX patch.

Wed Dec 26 11:56:36 PST 2007

Hi Timothy,
> I meant stuff like generating constants with 2 mov's and a punpckldq
> instead of a pcmpeq, psllw, psrlw, or your unbiased average trick, or
> manually unrolling oc_frag_recon_intra_mmx, or...
>   

Oh! That stuff - I just did it on the way while writing the functions. I 
had a profiler running side by side while programming and couldn't 
resist to optimize the functions while typing.

The add and shift version of frag_recon_inter2 is a tad faster btw. I 
don't know why but in practice it does make a difference. I'll change 
the code back to the GCC version.
> The point is that with a first-time contributor, maintainability of the
> code is paramount above all else, and that is much easier if I can match
> instruction by instruction your code with the existing code. 
Absolutely understandable. Will keep that in mind for the future :-)

> Also, is it really worth it to do so many loads early (and serially,
> since, e.g., the Core2 only has a single memory read port but 3 (well, 2
> and a half) arithmetic ports), instead of eating the latency up front
> and trying to keep the instruction mix more diverse? A similar question
> applies to the serial stores later.
>   
Imho you can't load early enough since most loads will wait on the cache 
to deliver data. If you issue the load later, you will wait longer at 
the next instruction that uses the register. If the loads otoh come from 
the cache nothing is lost by grouping the loads.

Some architectures (modern ARM and MIPS) love stores in groups. The CPU 
somehow detects cases where stores like this entirely fill a cache line. 
They can avoid to fill the cache line with what was previously in the 
main memory. Saves bandwidth.

Doing it this way worked very well for me in the past. I have no idea if 
the Core2 likes it though.

> That's insane. Can you at least do something along the lines of
> #define FOO _asm{ \
>  ... \
> }
>   
If you do something like this:

#define mytest(A,B) _asm   { \
  mov eax, (A)               \
  mov ebx, (B)               \
}

MSVC will see the two move instructions in one line and complains. 
That's stupid, ain't it? If you know a way to somehow add artificial 
newlines into macros like these expansion would be possible and I'll do 
a readable IDCT :-)

> and build up sequences in separate asm blocks? Or does it insert lots of
> garbage housekeeping instructions between them because it doesn't bother
> to track which registers are actually used in a block and tries to
> preserve everything.
>   
It does this, but only on a function level. This is a good thing because 
the inline-assembly code can be written to be independent on the calling 
conventions.

> Also, there's an ogg_uint64_t type, so there's no reason to stop at a
> 32-bit version.
>   
True but neither GCC nor MSVC have good support for 64 bit integers. All 
kinds of simple operation end up being subroutine calls into the RTL 
(try to shift a 64 bit integer left by one on a 32 bit machine and 
you'll see what I mean). It's sad since such codes would be perfect to 
write fast 32 and 64 bit routines in a compatible way.

I'll take a look at the c-code. Regarding the structure reorganization:

I've found out that the main bottleneck at the moment is the way theora 
stores the frames internally. It would be a huge undertaking to change 
this, but if the yuv-planes would be stored in a way that 8x8 blocks are 
linear in memory the amount of cache misses would go down dramatically.

Nearly all memory accesses to the source2 pointer in frag_recon_inter2 
are uncached. That's the reason why this function shows up so high in 
the profilings.

Btw - any reason not to use restrict pointers? All compilers I've worked 
with over the year (quite some) do support it nowadays. I know it's a 
C99 feature, but it can simply be defined away without causing any 
problems. This could give another nice speed improvement since the 
compiler does not need to guess about aliasing anymore.

  Nils