[theora-dev] SSE2 assembly support
Kay Tiong Khoo
kkhoo at rotateright.com
Thu Feb 11 00:58:13 PST 2010
Thanks for all the info and advice.
I took a profile using a statistical sampler of the example_encoder performing an encode of the deadline_cif.y4m media file. Below are the top 10 functions sorted by "Self" samples. "Total" samples occur in the symbol or its children.
OS: CentOS release 5.4 (Final) 2.6.18-164.el5
Processor: 4 x 2.40GHz Intel Core 2
Self Total Symbol
22.7% 22.7% oc_analyze_mb_mode_luma
16.0% 16.0% oc_enc_frag_satd2_thresh_mmxext
13.0% 13.0% oc_enc_frag_satd_thresh_mmxext
12.7% 12.7% oc_enc_tokenize_ac
5.7% 22.3% oc_enc_block_transform_quantize
5.0% 5.0% oc_analyze_mb_mode_chroma
4.0% 95.4% oc_enc_analyze_inter
2.7% 7.0% oc_mcenc_search_frame
2.6% 2.6% oc_enc_fdct8x8_mmx
1.7% 33.4% oc_cost_inter
The encoder was compiled with:
CFLAGS="-Wall -Wno-parentheses -g -O3 -fforce-addr -fno-omit-frame-pointer -finline-functions -funroll-loops"
The profile concurs with Timothy's assessment. The optimized MMX functions account for ~30% of the samples, so the room for improvement by conversion to SSE2 is limited. I will try some opportunistic optimizations before starting on the conversion work.
On Feb 11, 2010, at 7:30 AM, Timothy B. Terriberry wrote:
> There is some room for SSE2 optimizations (I just committed some earlier
> today), but right now the slowest functions in the encoder are all in C.
> A few of these could benefit from SIMD, but algorithmic optimizations
> will be both easier and give bigger performance improvements. Many of
> the existing SIMD functions operate on 8x8 blocks, and so MMX is
> generally enough to extract the maximum amount of parallelism.
> Restructuring things to operate on larger blocks when possible is a good
> idea, but a lot more work.
> Finally, I am not generally a fan of intrinsics because a) their
> portability is overrated and b) last I checked, compilers generate
> horrible code from them. The current inline asm already works for 32-bit
> and 64-bit platforms, except on Windows, but that is MSVC's fault.
> theora-dev mailing list
> theora-dev at xiph.org
More information about the theora-dev