[opus] Ask for suggestions about optimizing opus on STM32F407
Forrest Zhang
forrest at 263.net
Sun Feb 4 12:06:36 UTC 2018
Hello Thomas and Amit,
The problem has been solved! I really appreciate your helps!
Previously I got the worse performance on STM32F407ZG, because run opus on the external (FSMC) RAM.
If use internal RAM, it's faster about 6 to 7 times.
Generally the SILK encode requires more CPU, if use CELT/Fixed Point encoding and decoding 48kHz stereo audio, the speed is about 1.68 times of real time. But the speed of SILK/Fixed Point is about 0.73x real time.
I also do the performance test with an opus (Ogg format) audio file (48kHz sampling/Stereo, 48kbps, 4.2 seconds). Decode it firstly, then encode it immediately.
* Decode: 1468 ms
* Encode: 992 ms
* Total: 2461 ms
The speed is about 1.7x real time (4200/2461), and about 59% CPU usage.
I attached the detailed test result here for new developer reference.
* Opus Performance on STM32F407ZG
* ===============================
* 1000ms PCM samples (1150Hz cosine wave, amplitude 0x6000)
* Frame size: 20ms
* Case A: CELT, external memory, bitrate = 2x sampling, FIXED_POINT, DISABLE_FLOAT_API
* Case B: CELT, internal memory, bitrate = 2x sampling, FIXED_POINT, DISABLE_FLOAT_API
* Case C: CELT, internal memory, bitrate = 1x sampling, FIXED_POINT, DISABLE_FLOAT_API
* Case D: CELT, internal memory, bitrate = 2x sampling, FIXED_POINT, FLOAT_API
* Case E: CELT, internal memory, bitrate = 2x sampling, FLOAT, FLOAT_API
* Case F: SILK, internal memory, bitrate = 2x sampling, FLOAT, FLOAT_API
* Case G: SILK, internal memory, bitrate = 2x sampling, FIXED_POINT, FLOAT_API
* Result: encode time + decode time = total cost time (in milliseconds)
* Sampling*Chan A: External, 2x B: Internal, 2x C: Internal, 1x D: FLOAT API,2x E: FLOAT,2x E: SILK, FLOAT E: SILK, FIXED
* ============= ================== =============== =============== =============== =============== =============== ===============
* 48kHz * 2: 2123 + 1533 = 3698 346 + 234 = 587 305 + 216 = 528 352 + 236 = 595 534 + 392 = 937 7817+ 398 =8367 1013+ 345 =1374
* 48kHz * 1: 1292 + 907 = 2225 213 + 144 = 361 170 + 121 = 295 214 + 145 = 363 338 + 240 = 584 3922+ 230 =4215 525+ 196 = 729
* 24kHz * 2: 1862 + 1427 = 3325 298 + 207 = 511 239 + 176 = 402 301 + 209 = 516 443 + 306 = 758 7381+ 288 =7743 942+ 301 =1257
* 24kHz * 1: 1058 + 708 = 1860 169 + 119 = 291 141 + 104 = 248 172 + 117 = 293 240 + 175 = 420 3843+ 160 =4063 479+ 156 = 642
* 16kHz * 2: 1701 + 1372 = 3105 270 + 194 = 469 210 + 164 = 378 269 + 199 = 473 396 + 267 = 670 7384+ 119 =7646 683+ 93 = 785
* 16kHz * 1: 907 + 708 = 1633 142 + 104 = 249 116 + 99 = 217 144 + 103 = 250 180 + 139 = 323 3651+ 57 =3766 335+ 42 = 381
* 12kHz * 2: 1509 + 1240 = 2778 225 + 169 = 399 197 + 158 = 359 227 + 171 = 402 235 + 180 = 419 2919+ 53 =3008 299+ 31 = 333
* 12kHz * 1: 857 + 681 = 1555 136 + 97 = 236 117 + 84 = 203 137 + 96 = 236 159 + 128 = 290 2818+ 44 =2899 290+ 30 = 323
* 8kHz * 2: 1371 + 1168 = 2567 198 + 157 = 359 191 + 156 = 351 200 + 158 = 362 181 + 148 = 333 2173+ 35 =2237 246+ 28 = 276
* 8kHz * 1: 761 + 628 = 1404 111 + 92 = 205 106 + 89 = 197 120 + 84 = 206 123 + 100 = 226 2123+ 31 =2182 255+ 24 = 281
Sincerely
Forrest
On 2018/1/15 12:31, Forrest Zhang wrote:
> Hello Thomas and Amit,
>
> Thanks for your notice and the detailed decode performance report.
>
> I describe the details of my encode/decode test on STM32F407ZG.
>
> A. opus version: latest 1.2.1 (TI: opus 1.1.2)
> B. KEIL 5.23 (TI: ARM compiler tool chain 5.2.7)
> C. setup the encoder as the below (fs is the sampling frequency)
> enc = opus_encoder_create(fs, chans, OPUS_APPLICATION_AUDIO, &opus_err);
> opus_encoder_ctl(enc, OPUS_SET_BITRATE(fs * 2));
> opus_encoder_ctl(enc, OPUS_SET_BANDWIDTH(OPUS_AUTO));
> opus_encoder_ctl(enc, OPUS_SET_VBR(1));
> opus_encoder_ctl(enc, OPUS_SET_VBR_CONSTRAINT(0));
> opus_encoder_ctl(enc, OPUS_SET_COMPLEXITY(0));
> opus_encoder_ctl(enc, OPUS_SET_INBAND_FEC(0));
> opus_encoder_ctl(enc, OPUS_SET_FORCE_CHANNELS(OPUS_AUTO));
> opus_encoder_ctl(enc, OPUS_SET_DTX(0));
> opus_encoder_ctl(enc, OPUS_SET_PACKET_LOSS_PERC(0));
>
> opus_encoder_ctl(enc, OPUS_GET_LOOKAHEAD(&lookahead));
> opus_encoder_ctl(enc, OPUS_SET_LSB_DEPTH(16));
> opus_encoder_ctl(enc,
> OPUS_SET_EXPERT_FRAME_DURATION(OPUS_FRAMESIZE_20_MS));
> /* CELT is faster than SILK? */
> opus_encoder_ctl(enc, OPUS_SET_FORCE_MODE(MODE_CELT_ONLY));
> D. generate 20ms PCM sample data (Cosine wave with amplitude 0x6000 and
> frequency about 1150 Hz)
> E. encode the PCM data and decode it immediately, count the CPU usages.
> F. repeat until reach the duration time (1000ms or 10000ms)
> G. The summary of STM32F407 Test Result as below:
> Mode Sample Chan Freq. Duration Encode + Decode = Total
> FLOAT 48kHz 2 1150 1000ms 2735ms + 3367ms = 6102ms
>
> FIXED 48kHz 2 1150 1000ms 2112ms + 1543ms = 3698ms
> FIXED 48kHz 1 1150 1000ms 1312ms + 911ms = 2249ms
> FIXED 24kHz 1 1150 1000ms 1067ms + 783ms = 1872ms
> FIXED 16kHz 1 1150 1000ms 922ms + 711ms = 1651ms
> FIXED 12kHz 1 1150 1000ms 1296ms + 193ms = 1507ms
> FIXED 8kHz 2 1150 1000ms 1014ms + 147ms = 1181ms
> FIXED 8kHz 1 1150 1000ms 1086ms + 135ms = 1241ms
> FIXED 8kHz 1 1150 10000ms 11206ms + 1318ms = 12544ms
> H. Build Options
> FLOAT: OPUS_BUILD,USE_ALLOCA,CUSTOM_SUPPORT
> FIXED: OPUS_BUILD,USE_ALLOCA,CUSTOM_SUPPORT,FIXED_POINT,DISABLE_FLOAT_API
>
> Note: the target bit rate is twice of the sampling frequency. That's to say,
> the bit rate will be 96kbps, if the sampling frequency is 48kHz.
>
> The CPU usage is about 91% (911ms/1000ms), when decode 48KHz/mono/96bps. but
> encode requires more CPU (132%, 1312/1000ms).
>
> I will try lower bit rate and update the result later.
>
> Sincerely
> Forrest
>
> On Sunday, January 14, 2018 9:05:44 AM CST Thomas Böhm wrote:
>> Hello Forrest,
>> some years ago i developed a network media player based on a
>> STM32F407ZGT6 (168MHz clock) and opus 1.1.
>> I used just the fixed point code and did no particular optimization on
>> the opus code itself because the performance was already quite good, see
>> figures below.
>> The figures are for real time playback with different frame sizes and
>> various constant bit rates.
>> I didn't play that much with encoding, but I'm convinced that the 32F407
>> is powerful enough to do the job, if you use all its capabilities.
>>
>> Most important is to use the hardware features of the processor like the
>> DMA controller or the CRC calculation unit, if you deal with ogg, to
>> unload the CPU.
>>
>> SILK narrow band, a) mono b) stereo:
>>
>> SILK medium band, a) mono b) stereo:
>>
>> Hybride wide band, a) mono b) stereo:
>>
>> Hybride super wide band, a) mono b) stereo:
>>
>> Hybride full band, a) mono b) stereo:
>>
>>
>> CELT full band mono:
>>
>> CELT full band stereo:
>>
>> Regards,
>> Thomas
>>
>> Am 06.01.2018 um 10:02 schrieb forrest:
>>> Dear Developers,
>>>
>>>
>>> I make a opus 1.2.1 codec build for STM32F407(fixed-point and disable
>>> float APIs).
>>>
>>> it seems too slow for the VOIP application.
>>>
>>>
>>> Case 1:
>>>
>>> 48KHz Sampling rate, Stereo, VBR, frame size: 20ms, Bit-rates: 96kbps
>>>
>>> Encode cost: 2.11x real time
>>>
>>> Decode cost: 1.54x real time
>>>
>>> Encode + Decode: 3.65x
>>>
>>>
>>> Case 2:
>>>
>>> 8KHz Sampling rate, Mono, VBR, frame size: 20ms, Bit-rates: 16kbps
>>>
>>> Encode cost: 1.08x real time
>>>
>>> Decode cost: 0.14x real time
>>>
>>> Encode + Decode: 1.24x
>>>
>>>
>>> Are there any available optimizations or suggestions for Cortex-M4?
>>>
>>>
>>> As I knonw, TI TM4C129x is based on Cortex-M4 too:
>>>
>>> http://www.ti.com/tool/TIDM-TM4C129POEAUDIO
>>>
>>>
>>> The performance of opus on it is good enough for mono 48KHz sampling rate.
>>>
>>> CPU usage is only about 70% of 120MHz when encode/decode at same time.
>>>
>>>
>>> Sincerely
>>>
>>> Forrest
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> opus mailing list
>>> opus at xiph.org
>>> http://lists.xiph.org/mailman/listinfo/opus
>
>
>
More information about the opus
mailing list