如何优化 Cortex-M3 的滤波器环路? [英] How do I optimise a filter loop for Cortex-M3?

查看:20
本文介绍了如何优化 Cortex-M3 的滤波器环路?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我只需要更改代码,使其具有相同的基本功能但更加优化,基本上我认为过滤器循环是可以更改的主要代码段,因为我觉得那里的指令太多,但是不知道从哪里开始.我正在使用 Cortex M3 和 Thumb 2.

I just need to alter the code so that it does the same basic function but more optimised, basically I think the filter loop is the main piece of code that can be changed as I feel there are too many instructions in there, but don't know where to start with it. I am working with the Cortex M3 and Thumb 2.

我尝试篡改过滤器循环,以便可以将存储在寄存器中的前一个数字相加并除以 8,但我不知道如何真正执行该操作.

I have tried tampering with the filter loop, so that I could add the previous number stored in the register and divide that by 8, but I do not know how to really execute that.

; Perform in-place filtering of data supplied in memory
; the filter to be applied is a non-recursive filter of the form
; y[0] = x[-2]/8 + x[-1]/8 + x[0]/4 + x[1]/8 + x[2]/8

  ; set up the exception addresses
  THUMB
  AREA RESET, CODE, READONLY
  EXPORT __Vectors
  EXPORT Reset_Handler
__Vectors 
  DCD 0x00180000     ; top of the stack 
  DCD Reset_Handler  ; reset vector - where the program starts

num_words EQU (end_source-source)/4  ; number of input values
filter_length EQU 5  ; number of filter taps (values)

  AREA 2a_Code, CODE, READONLY
Reset_Handler
  ENTRY
  ; set up the filter parameters
  LDR r0,=source        ; point to the start of the area of memory holding inputs
  MOV r1,#num_words     ; get the number of input values
  MOV r2,#filter_length ; get the number of filter taps
  LDR r3,=dest          ; point to the start of the area of memory holding outputs

  ; find out how many times the filter needs to be applied
  SUBS r4,r1,r2   ; find the number of applications of the filter needed, less 1
  BMI exit        ; give up if there is insufficient data for any filtering

  ; apply the filter  
filter_loop
  LDMIA r0,{r5-r9}     ; get the next 5 data values to be filtered
  ADD r5,r5,r9         ; sum x[-2] with x[2]
  ADD r6,r6,r8         ; sum x[-1] with x[1]
  ADD r9,r5,r6         ; sum x[-2]+x[2] with x[-1]+x[1]
  ADD r7,r7,r9,LSR #1  ; sum x[0] with (x[-2]+x[2]+x[-1]+x[1])/2
  MOV r7,r7,LSR #2     ; form (x[0] + (x[-2]+x[-1]+x[1]+x[2])/2)/4
  STR r7,[r3],#4       ; save calculated filtered value, move to next output data item
  ADD r0,r0,#4         ; move to start of next 5 input data values
  SUBS r4,r4,#1        ; move on to next set of 5 inputs 
  BPL filter_loop      ; continue until last set of 5 inputs reached

  ; execute an endless loop once done 
exit    
  B exit

  AREA 2a_ROData, DATA, READONLY
source  ; some saw tooth data to filter - should blunt the sharp edges
  DCD 0,10,20,30,40,50,60,70,80,90,100,0,10,20,30,40,50,60,70,80,90,100
  DCD 0,10,20,30,40,50,60,70,80,90,100,0,10,20,30,40,50,60,70,80,90,100
  DCD 0,10,20,30,40,50,60,70,80,90,100,0,10,20,30,40,50,60,70,80,90,100
  DCD 0,10,20,30,40,50,60,70,80,90,100,0,10,20,30,40,50,60,70,80,90,100
end_source

  AREA 2a_RWData, DATA, READWRITE
dest  ; copy to this area of memory
  SPACE end_source-source
end_dest
  END
  END

我希望有一种更有效的方式来运行代码,天气可以减少代码的整体大小或加快循环的执行时间,只要它做同样的事情.任何帮助将不胜感激.

I expect there to be a more efficient way to run the code, weather that reduces the overall size of the code or speeds up the execution time of the cycles, as long as it does the same thing. Any help would be appreciated.

推荐答案

对于代码大小,尽量只使用可用于短 16 位编码的寄存器 r0..r7.

For code-size, try to only use registers r0..r7 which can be used in short 16-bit encodings.

此外,当非标志设置版本需要 32 位时,带有标志设置的指令版本通常具有 16 位编码.例如

Also, versions of instructions with flag-setting often have 16-bit encodings when the non-flag-setting version requires 32-bit. e.g.

  • adds r0, #4 是 16 位 vs. 32 位 add r0, #4
  • movs r7,r7,LSR #2 是 16 位 vs. 32 位 MOV r7,r7,LSR #2
  • movs r2,#filter_length 是 16 位与 32 位 MOV r2,#filter_length.(像 #88 这样的非小立即数仍然需要一个 32 位的 Thumb2 mov)
  • stmia r3!, {r5} (with write-back) is 16-bit vs. 32-bit str r7, [r3], #4 with post-增量.
  • adds r0, #4 is 16-bit vs. 32-bit add r0, #4
  • movs r7,r7,LSR #2 is 16-bit vs. 32-bit MOV r7,r7,LSR #2
  • movs r2,#filter_length is 16-bit vs. 32-bit MOV r2,#filter_length. (non-tiny immediates like #88 still need a 32-bit Thumb2 mov)
  • stmia r3!, {r5} (with write-back) is 16-bit vs. 32-bit str r7, [r3], #4 with post-increment.

请参阅我对您之前的问题的回答的 Thumb 代码大小部分:如何减少阶乘循环的执行时间和周期数?和/或代码大小?.查看代码的反汇编并查找 32 位指令,并检查它们为什么是 32 位,并寻找使它们成为 16 位的方法.这只是您可以随时进行的超基本 Thumb 优化.

See the Thumb code-size section of my answer on your earlier question: How do I reduce execution time and number of cycles for a factorial loop? And/or code-size?. Look at the disassembly for your code and look for 32-bit instructions, and check why they're 32-bit, and look for a way to make them 16-bit. This is just super-basic Thumb optimization that you can always do.

r1r2 甚至不在你的循环中使用,r4 = r1-r2 是一个汇编时间常数,你'在运行时使用 3 条指令进行计算......所以这显然与 movs r4, #num_words - filter_length 相比很疯狂.

r1 and r2 aren't even used inside your loop, and r4 = r1-r2 is an assemble-time constant that you're computing at runtime with 3 instructions... So that's obviously insane vs. movs r4, #num_words - filter_length.

如果这些应该是您的实际代码在汇编时未知的输入(也许有时在不同的输入上使用相同的函数?),然后在计算循环计数器后重用死"的寄存器.接受 r0 和 r3 中的指针有点笨拙,因此如果您使用 r1 作为 r1循环计数器,或者 r1-r2r5-r7 如果您使用 r4,则免费.

If those are supposed to be inputs that aren't known at assemble time for your real code (maybe the same function is sometimes used on different inputs?), then reuse the registers that are "dead" after calculating a loop counter. It's kind of clunky that you accept pointers in r0 and r3, so you then have r2 and r4-r7 free if you use r1 for the loop counter, or r1-r2 and r5-r7 free if you use r4.

我选择使用 r1 作为循环计数器.这是从我的版本 (arm-none-eabi-gcc -g -c -mthumb -mcpu=cortex-m3 arm-filter.S && arm-none-eabi-objdump -drwC arm-filter.o)

I chose to use r1 for the loop counter. This is disassembly from my version (arm-none-eabi-gcc -g -c -mthumb -mcpu=cortex-m3 arm-filter.S && arm-none-eabi-objdump -drwC arm-filter.o)

@@ Saving code size without any other changes

00000000 <function>:
   0:   480a            ldr     r0, [pc, #40]   ; (2c <exit+0x4>)
   2:   f05f 0158       movs.w  r1, #88 ; 0x58
   6:   2205            movs    r2, #5
   8:   4b09            ldr     r3, [pc, #36]   ; (30 <exit+0x8>)
   a:   1a89            subs    r1, r1, r2
   c:   d40c            bmi.n   28 <exit>

0000000e <filter_loop>:
   e:   e890 00f4       ldmia.w r0, {r2, r4, r5, r6, r7}
  12:   443a            add     r2, r7
  14:   4434            add     r4, r6
  16:   4414            add     r4, r2
  18:   eb15 0554       adds.w  r5, r5, r4, lsr #1
  1c:   08ad            lsrs    r5, r5, #2
  1e:   c320            stmia   r3!, {r5}
  20:   3004            adds    r0, #4
  22:   3901            subs    r1, #1
  24:   d5f3            bpl.n   e <filter_loop>

00000026 <exit>:
  26:   e7fe            b.n     26 <exit>

Cortex-M3 没有 NEON,但输出之间有数据重用.通过展开,我们绝对可以重用加载结果,以及一些内部"add 结果.也许用一个滑动窗口来减去不再属于总数的单词并添加新的单词.

Cortex-M3 doesn't have NEON, but there is data reuse between outputs. With unrolling, we can definitely reuse the load results, and some of the "inner" add results. Maybe with a sliding window to subtract the word that's no longer part of the total and add in the new one.

但是由于中间元素是特殊的",我们在两边都有两个 2 元素的窗口,除非我们在顶部有足够的备用位来添加 x[0] 两次然后右移3不溢出.然后你甚至不需要展开,只需加载1个元素/调整滑动窗口并重新计算中间/存储1个元素.

But with the middle element being "special", we have two 2-element windows on either side, unless we have enough spare bits at the top to add x[0] twice and then right shift by 3 without overflowing. Then you don't even need to unroll, just load 1 element / adjust sliding window and recalc the middle / store 1 element.

(我这个答案的第一个版本是基于对代码的误解.我可能会在稍后更新速度优化,但现在进行编辑以删除错误的内容.)

(My first version of this answer was based on a misunderstanding of the code. I might update with a speed optimization later, but for now editing to remove wrong stuff.)

这篇关于如何优化 Cortex-M3 的滤波器环路?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆