上的ARM Cortex-M0的不稳定周期计数 [英] Erratic cycle counts on ARM Cortex-M0

查看:568
本文介绍了上的ARM Cortex-M0的不稳定周期计数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

[大量的文字输入的,因为我想详细我的问题是最好的,我能。]

我在优化手写ARM汇编code代表一个的Cortex-M0的过程。我使用的主板是意法半导体STM32F0Discovery,其中有一个STM32F051R8控制器。该控制器运行在48兆赫。

I'm in the process of optimizing hand-written ARM assembly code for a Cortex-M0. The board I'm using is the STMicro STM32F0Discovery, which has an STM32F051R8 controller. The controller is running at 48 MHz.

不幸的是,我在做优化时得到一些pretty奇怪的循环计数。

Unfortunately, I'm getting some pretty strange cycle counts when doing optimizations.

例如,添加一个 NOP 进入我的code循环应当在总增加2个周期(循环2次)。但是,这样做增加了1800额外的周期。现在,当我在添加额外的 NOP (SO 2 NOP S IN总),循环计数确实增长预期4个周期。

For example, adding a single nop into a loop in my code should add 2 cycles in total (looped 2 times). However, doing so adds around 1800 extra cycles. Now, when I add in an extra nop (so 2 nops in total), the cycle count does increase by the expected 4 cycles.

我得到的实例篇低于code类似的奇怪的结果。这个例子code表示,为顶级摘录: C = 25 * A + 5 * B 。底部摘录 C = 5 *(5 * A + B)。因此,底部应该会更快,因为它需要少1 MOV 。然而,改变这个:

I get similar strange results for the example piece of code below. The example code shows, for the top excerpt: c = 25 * a + 5 * b. The bottom excerpt is c = 5 * (5 * a + b). So, the bottom one should be faster, since it requires 1 less mov. However, changing this:

movs r4, #25
muls r3, r4, r3
add  r2, r3 

ldrb r3, [r6, #RoundStep]
movs r4, #5
muls r3, r4, r3

add  r2, r3

这个:

movs r4, #5
muls r3, r4, r3 

ldrb r5, [r6, #RoundStep]
add r3, r5
muls r3, r4, r3

add  r2, r3

不被预期的1周期增加速度,相反,它由或多或少1000个循环降低速度...

does not increase the speed by the expected 1 cycle, instead, it decreases the speed by more or less 1000 cycles...

要算周期,我使用SysTick计数器,其最大值倒计时,并增加了对溢出中断溢出计数器。在code,我使用这或多或少是一样的 ARM网站的摘录,但改写为Cortex-M0,我使用。我的code是足够快,一个溢出中断在测量过程中从未发生过。

To count the cycles, I'm using the SysTick counter, counting down from its max value, and increasing an overflow counter on overflow interrupt. The code that I'm using for this is more or less the same as this excerpt from the ARM website, but rewritten for the Cortex-M0 that I'm using. My code is sufficiently fast that an overflow interrupt never happens during measurements.

现在,我开始认为计数器给我错误的价值观,所以我也写了一些code为TI的Stellaris LaunchPad的我已经躺在附近。这是在80 MHz频率运行的Cortex-M4F。的code措施周期的一定销的数量保持高电位​​。当然,M0的时钟和该M4F的​​不同步运行,因此所报告的周期计数变化了一点,通过取所测量的周期计数的非常低的加权指数平均我修复(平均= 0.995 *平均+ 0.005 * curCycles ),并重复测量10000次。

Now, I was starting to think that the counter was giving me wrong values, so I also wrote some code for a TI Stellaris LaunchPad I had lying around. This is a Cortex-M4F running at 80 MHz. The code measures the number of cycles a certain pin is held high. Of course, the clock of the M0 and that of the M4F aren't running in sync, so the reported cycle counts vary a bit, which I "fix" by taking a very low weighted exponential average of the measured cycle counts (avg = 0.995 * avg + 0.005 * curCycles) and repeating the measurement 10000 times.

由M0测量,因此受到M4F测的时间是相同的不幸看来SysTick计数器是工作在M0就好了。

The time measured by the M4F is the same as measured by the M0, so "unfortunately" it seems the SysTick counter is working just fine in the M0.

起初我还以为这些额外的延误是由于流水线停顿造成的,但一方面M0似乎是太简单了,另一方面我找不到对M0的管道的任何详细信息,这样我就可以T检验。

At first I thought these extra delays were caused by pipeline stalls, but on one hand the M0 seems to be too simple for that, and on the other I can't find any detailed info on the M0's pipeline, so I can't verify.

所以,我的问题是:什么是怎么回事?为什么加一个 NOP 让我的功能需要一个额外的1000次/循环,但做两 NOP 唯一增长2周期计数?为什么删除指令让我的code执行慢?

So, my question is: what is going on here? Why does adding a single nop make my function take an extra 1000 cycles/loop, but do two nops only increase the cycle count by 2? How come removing instructions makes my code execute slower?

推荐答案

MUL 指令可以的多个周期中的ALU管。你的 C = 25 * A + 5 * B C = 5 *(5 * A + B)转型要求少了一个 MOV 。然而,管道的加载存储的阶段过度奠定了与ALU。这些往往是独立的阶段,用 LDRB 指令,你可以得到 MOV 指令的免费的。此外,根据不同的值,在 MULS 可以执行速度更快;具体而言,是零顶部字节往往造成分拣机的的周期。有少得多的数据依赖的第一版本;指令的 N 的没有共同使用的寄存器的 N + 1 的。这是最基本的要求,允许管内衬。

The mul instruction can be multiple cycles of the ALU pipe. Your transformation of c = 25 * a + 5 * b into c = 5 * (5 * a + b) requires one less mov. However, the load-store stage of the pipeline over-lays with the ALU. These are often separate stages and with a ldrb instruction you can get mov instructions for free. Also, depending on the values, the muls may execute faster; specifically, the top bytes being zero often result in a sorter multiply cycle. There are far less data dependencies in the first version; instruction n has no registers in common with n+1. This is a basic requirement to allow pipe-lining.

相比,

ldrb r5, [r6, #RoundStep]  ; 2 cycles
add r3, r5                 ; must block for r5 to load (1 cycle)

ldrb r3, [r6, #RoundStep]  ; 2 cycles
movs r4, #5                ; may run in parallel with above.

所以,即使你可能加起来指令数,并有少code,它可以变成一个更大的备用运行速度将更快,由于管内衬的或的指令调度

2. 第二版本可能会更快,如果你能重新定位 LDRB 对程序的开始。

The 2nd version may be faster if you can relocate the ldrb towards the beginning of the routine.

这篇关于上的ARM Cortex-M0的不稳定周期计数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆