[theora-dev] SSE2 assembly support

Kay Tiong Khoo kkhoo at rotateright.com
Thu Feb 11 00:58:13 PST 2010


Thanks for all the info and advice.
I took a profile using a statistical sampler of the example_encoder performing an encode of the deadline_cif.y4m media file. Below are the top 10 functions sorted by "Self" samples. "Total" samples occur in the symbol or its children. 

OS: CentOS release 5.4 (Final) 2.6.18-164.el5
Processor: 4 x 2.40GHz Intel Core 2

      Self      Total Symbol 
     22.7%      22.7% oc_analyze_mb_mode_luma 
     16.0%      16.0% oc_enc_frag_satd2_thresh_mmxext
     13.0%      13.0% oc_enc_frag_satd_thresh_mmxext 
     12.7%      12.7% oc_enc_tokenize_ac 
      5.7%      22.3% oc_enc_block_transform_quantize 
      5.0%       5.0% oc_analyze_mb_mode_chroma 
      4.0%      95.4% oc_enc_analyze_inter 
      2.7%       7.0% oc_mcenc_search_frame 
      2.6%       2.6% oc_enc_fdct8x8_mmx 
      1.7%      33.4% oc_cost_inter  

The encoder was compiled with:

CFLAGS="-Wall -Wno-parentheses -g -O3 -fforce-addr -fno-omit-frame-pointer -finline-functions -funroll-loops"

The profile concurs with Timothy's assessment. The optimized MMX functions account for ~30% of the samples, so the room for improvement by conversion to SSE2 is limited. I will try some opportunistic optimizations before starting on the conversion work. 

Kay Khoo
RotateRight, LLC

On Feb 11, 2010, at 7:30 AM, Timothy B. Terriberry wrote:

> There is some room for SSE2 optimizations (I just committed some earlier
> today), but right now the slowest functions in the encoder are all in C.
>  A few of these could benefit from SIMD, but algorithmic optimizations
> will be both easier and give bigger performance improvements. Many of
> the existing SIMD functions operate on 8x8 blocks, and so MMX is
> generally enough to extract the maximum amount of parallelism.
> Restructuring things to operate on larger blocks when possible is a good
> idea, but a lot more work.
> Finally, I am not generally a fan of intrinsics because a) their
> portability is overrated and b) last I checked, compilers generate
> horrible code from them. The current inline asm already works for 32-bit
> and 64-bit platforms, except on Windows, but that is MSVC's fault.
> _______________________________________________
> theora-dev mailing list
> theora-dev at xiph.org
> http://lists.xiph.org/mailman/listinfo/theora-dev

More information about the theora-dev mailing list