[opus] [PATCH] 02-Add CELT filter optimizations

Timothy B. Terriberry tterribe at xiph.org
Wed May 22 09:27:03 PDT 2013

Aurélien Zanelli wrote:
> Could you describe it more precisely ? Also, IRC log could help me
> understand, but i don't know if you have it or where i can download it.

The channel has no archived logs that I know of, but here's the relevant 

11:52:32 <+jmspeex> And I agree we need to figure out how to do asm for 
Opus. Any thoughts?
11:52:43 <+jmspeex> I mean beyond just moving stuff to asm/

12:01:48 <+derf> jmspeex: Yes, I do, but let's talk after my 9 AM meeting.
12:04:47 <+jmspeex> ok

12:35:26 <+derf> jmspeex: Anyway, I think we should follow the general 
path we used for libtheora.
12:35:56 <+derf> For ARM, instead of using inline asm, we should use 
RVCT-syntax .s files and convert to gas syntax with a Perl script.
12:36:12 <+derf> And do runtime CPU detection for NEON at a minimum.

12:38:27 <+derf> Haven't worked through the best way to do the latter 
without adding function pointers to the state structs.
12:38:29 <+derf> Perhaps gmaxwell has ideas.
12:39:54 <+derf> It might also be worth looking at something like the 
libvpx approach which uses scripts to generate the headers, etc., for 
RTCD, since right now adding a function requires touching half a dozen 

13:04:23 <+jmspeex> derf: What's the advantage over inline asm?
13:04:57 <+derf> jmspeex: Works on !gcc.
13:05:19 <+derf> Unlike x86, ARM has a well-defined-enough ABI to make 
this practical.
13:05:32 <+jmspeex> OK, so the syntax is generic enough? What about the 
calling convension?
13:05:41 <+derf> That's included in ABI.
13:06:31 <+jmspeex> I mean aren't there multiple calling conventions 
allowed on a certain arch?
13:07:25 <+derf> No, calling conventions are consistent.
13:07:41 <+derf> I mean, there is thumb mode and ARM mode, but that 
doesn't present any complications.
13:08:20 <+jmspeex> I'm talking about fastcall, cdecl, ...
13:08:30 <+derf> No, ARM doesn't have any of that crap.
13:08:55 <+jmspeex> Only a single convention allowed?
13:09:04 <+derf> I've never run into another.
13:09:34 <+jmspeex> OK
13:09:50 <+derf> Some day we'll have to deal with AArch64, but let's 
cross that bridge when we get to it.
13:09:51 <+jmspeex> That was actually one of the main reasons I was 
sticking to inline assembly
13:10:06 <+derf> I know. That's why libtheora uses inline asm for x86.
13:10:09 <+derf> But not for ARM.
13:10:11 <+jmspeex> i.e. letting the compiler sort out the calling 
13:10:18 <+derf> x86 is a fucking mess.
13:10:36 <+jmspeex> derf: What about PIC?
13:10:54 <+derf> Nothing special about PIC in ARM.

13:11:13 < Ilari> What other x86 calling conventions there are besides 
fastcall and cdecl?
13:11:30 <+derf> stdcall
13:11:38 <+derf> (which is, of course non-standard)
13:12:12 < Ilari> Oh yeah, the Win32 shit...
13:12:14 <+derf> There's also issues of whether or not things get 
underscores prefixed to symbol names, 32/64-bit issues, stack alignment, 
13:13:04 <+jmspeex> derf: You mean no reserved register like how x86 
reserves ebx for PIC?
13:13:21 <+derf> jmspeex: No, the IP is just another register like any 
13:14:08 <+derf> There are a few instructions that can't _write_ to it, 
but you're free to use it in indexing calculations as much as you like.
13:14:49 <+gmaxwell> derf: pandaboard is sitting on my desk. Will have 
coverage again soon— I need to find sutable power (and also have a 
static ip at home again to stick it on)
13:15:09 <+derf> gmaxwell: Can't just use dyndns or something?

13:19:34 <+gmaxwell> no no, I have that now, as of friday. Just need to 
find a sutiable psu. Misuse of "and" above.
13:20:10 <+derf> Ah, okay, great.
13:21:53 < Ilari> Does that address have colons in it? :->

13:36:33 <+jmspeex> derf: Last thing, how do you select between the 
assembly and the C version?
13:37:12 <+derf> jmspeex: Well, in libtheora there's a struct of 
function pointers that you fill in when you create an encoder/decoder.
13:37:58 <+jmspeex> I mean for the compile-time assembly
13:37:59 <+derf> But I know how gmaxwell feels about function pointers 
in the codec state, so maybe he's got a better idea.
13:38:11 <+derf> Oh, see configure.ac.
13:38:21 <+derf> ds wrote us some nice m4 macros.

13:38:30 <+jmspeex> so you work on a file by file basis?

13:38:39 <+derf> Huh?
13:39:04 <+jmspeex> can you replace a single function in a C file?
13:39:28 <+derf> Um, configure.ac just sets some #defines.
13:39:42 <+jmspeex> Ah OK
13:41:11 <+jmspeex> As for function pointers, why in the state and not 
in a global struct for the entire library?

13:41:30 <+derf> Because threads.
13:41:47 <+jmspeex> what about threads?
13:42:01 <+derf> What do you mean what about them?
13:42:08 <+jmspeex> you want to run threads on different architectures 
at the same time?
13:42:09 <+derf> Who initializes that global struct?
13:42:27 <+jmspeex> opus_lib_detect_cpu()
13:42:38 <+derf> And if I don't call it?
13:42:43 <+jmspeex> if you don't do that, you get the default values
13:42:56 <+jmspeex> i.e. the settings that work on any arch
13:42:58 <+derf> So in other words, all currently deployed software is 
going to suck.
13:43:19 <+jmspeex> until they add the call, yes.
13:43:33 <+jmspeex> It's still going to work
13:43:50 <+jmspeex> This is an audio codec, not a video codec
13:43:59 <+derf> Seems pretty shitty to me.
13:44:11 <+jmspeex> Otherwise, there's always another easier solution
13:44:26 <+jmspeex> you have multiple global structs (const)
13:44:40 <+jmspeex> then you just have an int in the state that selects 
which one to use
13:44:55 <+derf> Yeah, that is something along the lines of what I was 
13:45:10 <+derf> But not much more secure if that int is out of bounds.
13:45:54 <+jmspeex> It's probably possible to check
13:46:29 <+derf> Sure, but checking on every call is slow.
13:46:48 <+jmspeex> At the very least we can check at API entry points
13:47:17 < ron_> there is pthread_once(), but that's not without 
complications either.

13:47:33 <+gmaxwell> I'd generally rather enum and switch the dispatch, 
but unless there are only a two or three options for a platform that 
becomes a mess... if a function pointer in the codec state makes it 
cleaner than so be it— mostly my opinion is that they should never be 
used when easily avoided.. but it really needs a canary at least.
13:47:40 <+derf> jmspeex: Doesn't help if the buffer overflow occurs in 
the middle of a call, which of course is where it will occur.
13:47:51 <+jmspeex> ron_: I'm not adding a dependency on pthread
13:48:18 < ron_> jmspeex: it can be inside _REENTRANT, and optional.
13:48:20 <+gmaxwell> derf: well, some other part of the caller can stop 
the encoder state... say ... libogg.
13:49:09 <+jmspeex> derf: Then ptr_struct[st->arch&ARCH_MASK]
13:49:55 <+jmspeex> enough to defeat the overflow?
13:50:35 <+derf> jmspeex: It was the best I could come up with.
13:51:33 <+gmaxwell> sounds fine to me.
13:52:38 <+jmspeex> now, the other annoying thing is that I think we'll 
need one struct for celt and a different one for SILK
13:53:17 <+derf> We certainly need separate ints.
13:53:24 <+derf> Since they can't see each other's states.
13:53:53 <+jmspeex> Also I don't want to include silk stuff from celt 
because of opus custom
13:53:57 <+gmaxwell> Why not one struct per function, with one slot in 
each struct for the subarches?
13:55:00 <+jmspeex> gmaxwell: how would we select the subarch. It needs 
to be indexed
13:55:22 <+derf> jmspeex: With the index, like before.
13:55:33 <+derf> This is not a terrible idea.
13:55:55 <+jmspeex> Then I don't understand what he meant
13:56:11 <+derf> Well, he meant "array per function" instead of struct.
13:56:18 <+gmaxwell> ya
13:56:29 <+jmspeex> So there's no struct at all?
13:56:34 <+derf> Exactly.
13:56:36 <+gmaxwell> OP_fir_foo[st->arch&OP_ARCHMASK]
13:57:15 <+jmspeex> sounds like more code and easier to mess up
13:57:30 <+gmaxwell> Hm? it ends up being a macro in any case.
13:58:26 <+jmspeex> I mean if you add a subarch or add a function, you 
need to touch more places, with less warning if you screw something up
13:58:40 < ron_> is it really better if st->arch points to something 
that will explode than if it just points into space?
13:59:18 <+gmaxwell> ron_: yes, its really better if it points to 
something that will throw an illegal option than if its overwritten with 
a call into libc that starts a shell.
13:59:29 <+gmaxwell> s/option/operation/ :)

14:00:40 < ron_> but you need to overwrite the memory following the 
array to do that.  and if you can do that, can't you just overwrite one 
of the functions anyhow?
14:00:46 <+jmspeex> I'm still wondering how useful auto-detection is for 
an audio codec
14:01:23 <+jmspeex> Especially considering that the two most important 
aspects cannot be auto-detected: 1) Fixed vs float and 2) DSP extensions
14:01:38 < ron_> NEON isn't guaranteed to be there, so if you want it it 
needs runtime detection.

14:03:14 <+gmaxwell> likewise, SSE isn't guaranteed to be there on 
x86_32 (think atom based tablets/smartphones)
14:03:28 <+jmspeex> ron_: I mean as opposed to having a generic build on 
(e.g.) smartphones and a very specific set of optimizations for 
individual devices
14:04:19 < ron_> SSE we can actually do separate builds for and put them 
in magic dirs where ld.so will find them if it's available.  but derf 
wanted to choke me the last time I suggested that :)
14:04:29 < ron_> it is what we do for speex though.
14:04:45 <+jmspeex> I mean some x86 might want fixed-point and there's 
no way to make that a run-time setting (not sanely at least)
14:08:11 <+derf> jmspeex: My target is Firefox.
14:08:20 <+derf> There is one Firefox binary for all ARMv7 devices.
14:08:25 -!- sputnik_too [~sputnik at funtoo/user/sputnik-too] has joined #opus
14:08:26 <+derf> Some of them have NEON, some of them don't.
14:08:49 <+derf> So runtime CPU detection is mandatory if you want this 
to land in Firefox.
14:08:59 < ron_> that's exactly the same situation as we'll have for 
debian's armhf port too.
14:09:10 <+derf> And for most Android apps.
14:09:45 <+jmspeex> derf: My question is whether it really matters on a 
14:09:52 <+derf> It does.
14:09:56 <+jmspeex> OK then
14:10:03 <+derf> I'm not sure why you think it doesn't.
14:10:14 <+jmspeex> because audio is much cheaper than video
14:10:38 <+derf> Right now for WebRTC on B2G, video is using 60% of the 
CPU and audio is using 80%.
14:10:50 <+derf> Now, that's because the video is 15 fps QCIF.
14:10:51 <+derf> But still.
14:11:31 <+derf> Why do you think I've been putting energy into this?
14:11:47 <+jmspeex> Well, it's good to know now
14:12:18 <+derf> If that guy on the ML is right that we can cut 
complexity down to 1/3 the current complexity, that will help a lot.
14:12:32 <+derf> Well, current before these patches started.
14:13:45 <+jmspeex> where did he claim that?
14:15:11 <+derf> http://lists.xiph.org/pipermail/opus/2013-May/002066.html
14:47:23 < Ilari> What is "MCPS" (mentioned in that post)?
14:47:47 < Ilari> Millions of Computations Per Second?
14:47:57 < Prodicus> million cycles per second
14:51:55 <+derf> "MHz"
14:52:01 <+derf> (or: what Prodicus said)
15:03:42 < Prodicus> a little while back there were some rumblings here 
about tagging 1.0.3 in the near term; IIRC somebody brought up a blocker 
but I can't remember what. What's the plan?
15:05:24 <+derf> The things on my list are the MSVC project files and 
auditing ALLOC_STACK.
15:29:35 <+gmaxwell> re that list post: when I read it I thought it said 
that they were ignoring float on A9. Am I reading that wrong?  Uh. Thats 
unwise... the float is faster on that chip for me.

15:40:51 <+derf> gmaxwell: Well, not once they've optimized fixed-point!
15:41:20 <+derf> But FWIW, we use fixed-point on all ARM builds, regardless.
15:41:40 <+derf> Because, as jmspeex said, you can't switch at runtime, 
and while fixed-point might be moderately slower on some chips, float is 
disastrously slower on others.
15:41:46 <+gmaxwell> that thought was why I did say for me. :)  But ... 
bleh. arm stinks.
15:41:59 <+derf> So I'm happy to ignore float as well.

More information about the opus mailing list