[opus] [PATCH] 02-Add CELT filter optimizations
Timothy B. Terriberry
tterribe at xiph.org
Wed May 22 09:27:03 PDT 2013
Aurélien Zanelli wrote:
> Could you describe it more precisely ? Also, IRC log could help me
> understand, but i don't know if you have it or where i can download it.
The channel has no archived logs that I know of, but here's the relevant
conversation:
11:52:32 <+jmspeex> And I agree we need to figure out how to do asm for
Opus. Any thoughts?
11:52:43 <+jmspeex> I mean beyond just moving stuff to asm/
12:01:48 <+derf> jmspeex: Yes, I do, but let's talk after my 9 AM meeting.
12:04:47 <+jmspeex> ok
12:35:26 <+derf> jmspeex: Anyway, I think we should follow the general
path we used for libtheora.
12:35:56 <+derf> For ARM, instead of using inline asm, we should use
RVCT-syntax .s files and convert to gas syntax with a Perl script.
12:36:12 <+derf> And do runtime CPU detection for NEON at a minimum.
12:38:27 <+derf> Haven't worked through the best way to do the latter
without adding function pointers to the state structs.
12:38:29 <+derf> Perhaps gmaxwell has ideas.
12:39:54 <+derf> It might also be worth looking at something like the
libvpx approach which uses scripts to generate the headers, etc., for
RTCD, since right now adding a function requires touching half a dozen
places.
13:04:23 <+jmspeex> derf: What's the advantage over inline asm?
13:04:57 <+derf> jmspeex: Works on !gcc.
13:05:19 <+derf> Unlike x86, ARM has a well-defined-enough ABI to make
this practical.
13:05:32 <+jmspeex> OK, so the syntax is generic enough? What about the
calling convension?
13:05:41 <+derf> That's included in ABI.
13:06:31 <+jmspeex> I mean aren't there multiple calling conventions
allowed on a certain arch?
13:07:25 <+derf> No, calling conventions are consistent.
13:07:41 <+derf> I mean, there is thumb mode and ARM mode, but that
doesn't present any complications.
13:08:20 <+jmspeex> I'm talking about fastcall, cdecl, ...
13:08:30 <+derf> No, ARM doesn't have any of that crap.
13:08:55 <+jmspeex> Only a single convention allowed?
13:09:04 <+derf> I've never run into another.
13:09:34 <+jmspeex> OK
13:09:50 <+derf> Some day we'll have to deal with AArch64, but let's
cross that bridge when we get to it.
13:09:51 <+jmspeex> That was actually one of the main reasons I was
sticking to inline assembly
13:10:06 <+derf> I know. That's why libtheora uses inline asm for x86.
13:10:09 <+derf> But not for ARM.
13:10:11 <+jmspeex> i.e. letting the compiler sort out the calling
convention
13:10:18 <+derf> x86 is a fucking mess.
13:10:36 <+jmspeex> derf: What about PIC?
13:10:54 <+derf> Nothing special about PIC in ARM.
13:11:13 < Ilari> What other x86 calling conventions there are besides
fastcall and cdecl?
13:11:30 <+derf> stdcall
13:11:38 <+derf> (which is, of course non-standard)
13:12:12 < Ilari> Oh yeah, the Win32 shit...
13:12:14 <+derf> There's also issues of whether or not things get
underscores prefixed to symbol names, 32/64-bit issues, stack alignment,
etc.
13:13:04 <+jmspeex> derf: You mean no reserved register like how x86
reserves ebx for PIC?
13:13:21 <+derf> jmspeex: No, the IP is just another register like any
other.
13:14:08 <+derf> There are a few instructions that can't _write_ to it,
but you're free to use it in indexing calculations as much as you like.
13:14:49 <+gmaxwell> derf: pandaboard is sitting on my desk. Will have
coverage again soon— I need to find sutable power (and also have a
static ip at home again to stick it on)
13:15:09 <+derf> gmaxwell: Can't just use dyndns or something?
13:19:34 <+gmaxwell> no no, I have that now, as of friday. Just need to
find a sutiable psu. Misuse of "and" above.
13:20:10 <+derf> Ah, okay, great.
13:21:53 < Ilari> Does that address have colons in it? :->
13:36:33 <+jmspeex> derf: Last thing, how do you select between the
assembly and the C version?
13:37:12 <+derf> jmspeex: Well, in libtheora there's a struct of
function pointers that you fill in when you create an encoder/decoder.
13:37:58 <+jmspeex> I mean for the compile-time assembly
13:37:59 <+derf> But I know how gmaxwell feels about function pointers
in the codec state, so maybe he's got a better idea.
13:38:11 <+derf> Oh, see configure.ac.
13:38:21 <+derf> ds wrote us some nice m4 macros.
13:38:30 <+jmspeex> so you work on a file by file basis?
13:38:39 <+derf> Huh?
13:39:04 <+jmspeex> can you replace a single function in a C file?
13:39:28 <+derf> Um, configure.ac just sets some #defines.
13:39:42 <+jmspeex> Ah OK
13:41:11 <+jmspeex> As for function pointers, why in the state and not
in a global struct for the entire library?
13:41:30 <+derf> Because threads.
13:41:47 <+jmspeex> what about threads?
13:42:01 <+derf> What do you mean what about them?
13:42:08 <+jmspeex> you want to run threads on different architectures
at the same time?
13:42:09 <+derf> Who initializes that global struct?
13:42:27 <+jmspeex> opus_lib_detect_cpu()
13:42:38 <+derf> And if I don't call it?
13:42:43 <+jmspeex> if you don't do that, you get the default values
13:42:56 <+jmspeex> i.e. the settings that work on any arch
13:42:58 <+derf> So in other words, all currently deployed software is
going to suck.
13:43:19 <+jmspeex> until they add the call, yes.
13:43:33 <+jmspeex> It's still going to work
13:43:50 <+jmspeex> This is an audio codec, not a video codec
13:43:59 <+derf> Seems pretty shitty to me.
13:44:11 <+jmspeex> Otherwise, there's always another easier solution
13:44:26 <+jmspeex> you have multiple global structs (const)
13:44:40 <+jmspeex> then you just have an int in the state that selects
which one to use
13:44:55 <+derf> Yeah, that is something along the lines of what I was
thinking.
13:45:10 <+derf> But not much more secure if that int is out of bounds.
13:45:54 <+jmspeex> It's probably possible to check
13:46:29 <+derf> Sure, but checking on every call is slow.
13:46:48 <+jmspeex> At the very least we can check at API entry points
13:47:17 < ron_> there is pthread_once(), but that's not without
complications either.
13:47:33 <+gmaxwell> I'd generally rather enum and switch the dispatch,
but unless there are only a two or three options for a platform that
becomes a mess... if a function pointer in the codec state makes it
cleaner than so be it— mostly my opinion is that they should never be
used when easily avoided.. but it really needs a canary at least.
13:47:40 <+derf> jmspeex: Doesn't help if the buffer overflow occurs in
the middle of a call, which of course is where it will occur.
13:47:51 <+jmspeex> ron_: I'm not adding a dependency on pthread
13:48:18 < ron_> jmspeex: it can be inside _REENTRANT, and optional.
13:48:20 <+gmaxwell> derf: well, some other part of the caller can stop
the encoder state... say ... libogg.
13:49:09 <+jmspeex> derf: Then ptr_struct[st->arch&ARCH_MASK]
13:49:55 <+jmspeex> enough to defeat the overflow?
13:50:35 <+derf> jmspeex: It was the best I could come up with.
13:51:33 <+gmaxwell> sounds fine to me.
13:52:38 <+jmspeex> now, the other annoying thing is that I think we'll
need one struct for celt and a different one for SILK
13:53:17 <+derf> We certainly need separate ints.
13:53:24 <+derf> Since they can't see each other's states.
13:53:53 <+jmspeex> Also I don't want to include silk stuff from celt
because of opus custom
13:53:57 <+gmaxwell> Why not one struct per function, with one slot in
each struct for the subarches?
13:55:00 <+jmspeex> gmaxwell: how would we select the subarch. It needs
to be indexed
13:55:22 <+derf> jmspeex: With the index, like before.
13:55:33 <+derf> This is not a terrible idea.
13:55:55 <+jmspeex> Then I don't understand what he meant
13:56:11 <+derf> Well, he meant "array per function" instead of struct.
13:56:18 <+gmaxwell> ya
13:56:29 <+jmspeex> So there's no struct at all?
13:56:34 <+derf> Exactly.
13:56:36 <+gmaxwell> OP_fir_foo[st->arch&OP_ARCHMASK]
13:57:15 <+jmspeex> sounds like more code and easier to mess up
13:57:30 <+gmaxwell> Hm? it ends up being a macro in any case.
13:58:26 <+jmspeex> I mean if you add a subarch or add a function, you
need to touch more places, with less warning if you screw something up
13:58:40 < ron_> is it really better if st->arch points to something
that will explode than if it just points into space?
13:59:18 <+gmaxwell> ron_: yes, its really better if it points to
something that will throw an illegal option than if its overwritten with
a call into libc that starts a shell.
13:59:29 <+gmaxwell> s/option/operation/ :)
14:00:40 < ron_> but you need to overwrite the memory following the
array to do that. and if you can do that, can't you just overwrite one
of the functions anyhow?
14:00:46 <+jmspeex> I'm still wondering how useful auto-detection is for
an audio codec
14:01:23 <+jmspeex> Especially considering that the two most important
aspects cannot be auto-detected: 1) Fixed vs float and 2) DSP extensions
14:01:38 < ron_> NEON isn't guaranteed to be there, so if you want it it
needs runtime detection.
14:03:14 <+gmaxwell> likewise, SSE isn't guaranteed to be there on
x86_32 (think atom based tablets/smartphones)
14:03:28 <+jmspeex> ron_: I mean as opposed to having a generic build on
(e.g.) smartphones and a very specific set of optimizations for
individual devices
14:04:19 < ron_> SSE we can actually do separate builds for and put them
in magic dirs where ld.so will find them if it's available. but derf
wanted to choke me the last time I suggested that :)
14:04:29 < ron_> it is what we do for speex though.
14:04:45 <+jmspeex> I mean some x86 might want fixed-point and there's
no way to make that a run-time setting (not sanely at least)
14:08:11 <+derf> jmspeex: My target is Firefox.
14:08:20 <+derf> There is one Firefox binary for all ARMv7 devices.
14:08:25 -!- sputnik_too [~sputnik at funtoo/user/sputnik-too] has joined #opus
14:08:26 <+derf> Some of them have NEON, some of them don't.
14:08:49 <+derf> So runtime CPU detection is mandatory if you want this
to land in Firefox.
14:08:59 < ron_> that's exactly the same situation as we'll have for
debian's armhf port too.
14:09:10 <+derf> And for most Android apps.
14:09:45 <+jmspeex> derf: My question is whether it really matters on a
smartphone
14:09:52 <+derf> It does.
14:09:56 <+jmspeex> OK then
14:10:03 <+derf> I'm not sure why you think it doesn't.
14:10:14 <+jmspeex> because audio is much cheaper than video
14:10:38 <+derf> Right now for WebRTC on B2G, video is using 60% of the
CPU and audio is using 80%.
14:10:50 <+derf> Now, that's because the video is 15 fps QCIF.
14:10:51 <+derf> But still.
14:11:31 <+derf> Why do you think I've been putting energy into this?
14:11:47 <+jmspeex> Well, it's good to know now
14:12:18 <+derf> If that guy on the ML is right that we can cut
complexity down to 1/3 the current complexity, that will help a lot.
14:12:32 <+derf> Well, current before these patches started.
14:13:45 <+jmspeex> where did he claim that?
14:15:11 <+derf> http://lists.xiph.org/pipermail/opus/2013-May/002066.html
14:47:23 < Ilari> What is "MCPS" (mentioned in that post)?
14:47:47 < Ilari> Millions of Computations Per Second?
14:47:57 < Prodicus> million cycles per second
14:51:55 <+derf> "MHz"
14:52:01 <+derf> (or: what Prodicus said)
15:03:42 < Prodicus> a little while back there were some rumblings here
about tagging 1.0.3 in the near term; IIRC somebody brought up a blocker
but I can't remember what. What's the plan?
15:05:24 <+derf> The things on my list are the MSVC project files and
auditing ALLOC_STACK.
15:29:35 <+gmaxwell> re that list post: when I read it I thought it said
that they were ignoring float on A9. Am I reading that wrong? Uh. Thats
unwise... the float is faster on that chip for me.
15:40:51 <+derf> gmaxwell: Well, not once they've optimized fixed-point!
15:41:20 <+derf> But FWIW, we use fixed-point on all ARM builds, regardless.
15:41:40 <+derf> Because, as jmspeex said, you can't switch at runtime,
and while fixed-point might be moderately slower on some chips, float is
disastrously slower on others.
15:41:46 <+gmaxwell> that thought was why I did say for me. :) But ...
bleh. arm stinks.
15:41:59 <+derf> So I'm happy to ignore float as well.
More information about the opus
mailing list