[theora-dev] Proposal for replacing asm code with intrinsics
vadim.sukhomlinov at intel.com
Tue Oct 13 06:14:50 PDT 2009
I'm new to Theora and would like to propose several performance optimization using advanced instructions in x86 CPUs (SSE2-SSE4.2).
There are several source files in \x86 and \x86_vc which developed using inline assembler. However this cause several maintenance problems:
1) Need to sync gcc & msvc versions
2) Only 32bit environment is supported
3) No support for newer than MMX instruction sets
My proposal is to replace all functions in assembly with compiler intrinsic which compiles into 1-2 assembly instructions and are much easier to maintain.
_mm_sad_epu8(__m128, __m128) will be compiled in PSADBW instruction with compiler-allocated registers.
And code like:
Can be re-written into
_m64 mm0, mm4, mm5, mm6, mm7; //of course using meaningful names
mm0= _mm_add_epi16(mm0, _mm_sad_pu8(mm4, mm5));
Compiler will replace variables with actual registers, ensuring better allocation and scheduling of them.
So, benefits are:
1) Easier to read & understand code which can use same variable names as generic version in C
2) Single source code for gcc & msvc & intel compiler (all of them supports same syntax)
3) Easier migration to SSE2 (which can handle 128bit vs. 64 as with MMX) thru replacement of __m64 to __m128
4) 64-bit code generation support
5) Compiler can reschedule instructions based on target CPU to deliver better performance w/o manual tuning. I did several tests with high-quality manually optimized assembly in the past and then replaced it to intrinsics which resulted in 3-5% better performance when using Intel compiler. Anyway, I don't expect any performance issues with it.
It will require some change in project structure and makefiles and I'm not sure if this ok - at least I don't know how to coordinate work on Theora with over developers. Could you please help me here?
Thanks in advance,
From: theora-dev-bounces at xiph.org [mailto:theora-dev-bounces at xiph.org] On Behalf Of theora-dev-request at xiph.org
Sent: Thursday, October 08, 2009 11:00 PM
To: theora-dev at xiph.org
Subject: theora-dev Digest, Vol 65, Issue 2
Send theora-dev mailing list submissions to
theora-dev at xiph.org
To subscribe or unsubscribe via the World Wide Web, visit
or, via email, send a message with subject or body 'help' to
theora-dev-request at xiph.org
You can reach the person managing the list at
theora-dev-owner at xiph.org
When replying, please edit your Subject line so it is more specific
than "Re: Contents of theora-dev digest..."
1. Possible inefficiency in encode.c (Chris Cooksey)
2. Re: Possible inefficiency in encode.c (Timothy B. Terriberry)
Date: Wed, 07 Oct 2009 17:40:43 -0400
From: Chris Cooksey <chriscooksey at gmail.com>
Subject: [theora-dev] Possible inefficiency in encode.c
To: <theora-dev at xiph.org>
Message-ID: <C6F2831B.28AF4%chriscooksey at gmail.com>
Content-Type: text/plain; charset="US-ASCII"
I am very new to Theora, having just started working through the code a few
I am working on a requantization tool to reduce bit rates, hopefully on the
fly, for some video conferencing work.
As I was working through the encoding phase I noticed this line in encode.c:
It's around line 804, but I am working with 1.1b3 sources so it may have
moved a bit.
Anyway, I am thinking that this line might be an adequate substitute:
Because the tokens are now stored in separate per plane arrays instead of
all strung together in one big array like they used to be. I presume the
point of doing that was to eliminate the need for dct_token_offs altogether.
I see dct_token_offs being used in a couple of other places too.
I could be wrong of course. Please don't beat this neophyte up if I am :-)
Date: Wed, 07 Oct 2009 23:37:18 -0400
From: "Timothy B. Terriberry" <tterribe at email.unc.edu>
Subject: Re: [theora-dev] Possible inefficiency in encode.c
To: theora-dev at xiph.org
Message-ID: <4ACD5E6E.8040705 at email.unc.edu>
Content-Type: text/plain; charset=ISO-8859-1
Chris Cooksey wrote:
> Because the tokens are now stored in separate per plane arrays instead of
> all strung together in one big array like they used to be. I presume the
> point of doing that was to eliminate the need for dct_token_offs altogether.
The actual point was so that the token lists could be filled in a
different order than the one in which they will appear in the bitstream.
However, one of the consequences of this is that EOB runs cannot span
lists, even though the bitstream allows it.
This is fixed up after tokenization, before packing the tokens into the
packet, in oc_enc_tokenize_finish(). What this means is that sometimes
the first token in the list must be skipped, because it was an EOB run
that has actually been merged with the last token in a different list.
dct_token_offs marks which lists need to skip such a token (i.e.,
it's always either 0 or 1).
It would actually probably be faster to keep things in a single
contiguous array, with offsets to the individual lists, just because it
would remove an extra indirection that C compilers generally do a poor
job of optimizing. We did this in the decoder, and it did provide a
small speed-up. I just never got around to doing it in the encoder.
theora-dev mailing list
theora-dev at xiph.org
End of theora-dev Digest, Vol 65, Issue 2
Closed Joint Stock Company Intel A/O
Registered legal address: Krylatsky Hills Business Park,
17 Krylatskaya Str., Bldg 4, Moscow 121614,
This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.
More information about the theora-dev