ARM Cortex-M0 上的不稳定周期计数 [英] Erratic cycle counts on ARM Cortex-M0

查看:39
本文介绍了ARM Cortex-M0 上的不稳定周期计数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

[收到大量文字,因为我想尽可能详细地说明我的问题.]

我正在优化 Cortex-M0 的手写 ARM 汇编代码.我使用的板子是 STMicro STM32F0Discovery,它有一个 STM32F051R8 控制器.控制器以 48 MHz 运行.

I'm in the process of optimizing hand-written ARM assembly code for a Cortex-M0. The board I'm using is the STMicro STM32F0Discovery, which has an STM32F051R8 controller. The controller is running at 48 MHz.

不幸的是,在进行优化时,我得到了一些非常奇怪的循环计数.

Unfortunately, I'm getting some pretty strange cycle counts when doing optimizations.

例如,在我的代码中将单个 nop 添加到循环中应该总共添加 2 个循环(循环 2 次).但是,这样做会增加大约 1800 个额外的周期.现在,当我添加一个额外的 nop(总共 2 个 nop s)时,循环计数确实增加了预期的 4 个循环.

For example, adding a single nop into a loop in my code should add 2 cycles in total (looped 2 times). However, doing so adds around 1800 extra cycles. Now, when I add in an extra nop (so 2 nops in total), the cycle count does increase by the expected 4 cycles.

对于下面的示例代码,我得到了类似的奇怪结果.示例代码显示,顶部摘录:c = 25 * a + 5 * b.底部摘录是c = 5 * (5 * a + b).因此,底部的应该更快,因为它需要少 1 个 mov.然而,改变这一点:

I get similar strange results for the example piece of code below. The example code shows, for the top excerpt: c = 25 * a + 5 * b. The bottom excerpt is c = 5 * (5 * a + b). So, the bottom one should be faster, since it requires 1 less mov. However, changing this:

movs r4, #25
muls r3, r4, r3
add  r2, r3 

ldrb r3, [r6, #RoundStep]
movs r4, #5
muls r3, r4, r3

add  r2, r3

进入这个:

movs r4, #5
muls r3, r4, r3 

ldrb r5, [r6, #RoundStep]
add r3, r5
muls r3, r4, r3

add  r2, r3

不会将速度提高预期的 1 个周期,而是将速度降低或多或少 1000 个周期...

does not increase the speed by the expected 1 cycle, instead, it decreases the speed by more or less 1000 cycles...

为了计算周期数,我使用了 SysTick 计数器,从其最大值开始倒计时,并在溢出中断时增加溢出计数器.我为此使用的代码或多或少与 此摘录 来自 ARM 网站,但针对我正在使用的 Cortex-M0 重写.我的代码足够快,以至于在测量期间从未发生溢出中断.

To count the cycles, I'm using the SysTick counter, counting down from its max value, and increasing an overflow counter on overflow interrupt. The code that I'm using for this is more or less the same as this excerpt from the ARM website, but rewritten for the Cortex-M0 that I'm using. My code is sufficiently fast that an overflow interrupt never happens during measurements.

现在,我开始认为计数器给了我错误的值,所以我还为我闲置的 TI Stellaris LaunchPad 编写了一些代码.这是一个以 80 MHz 运行的 Cortex-M4F.该代码测量某个引脚保持高电平的周期数.当然,M0 的时钟和 M4F 的时钟不同步运行,因此报告的周期计数略有不同,我通过采用测量周期计数的非常低的加权指数平均值来修复"(avg = 0.995 * avg + 0.005 * curCycles) 并重复测量 10000 次.

Now, I was starting to think that the counter was giving me wrong values, so I also wrote some code for a TI Stellaris LaunchPad I had lying around. This is a Cortex-M4F running at 80 MHz. The code measures the number of cycles a certain pin is held high. Of course, the clock of the M0 and that of the M4F aren't running in sync, so the reported cycle counts vary a bit, which I "fix" by taking a very low weighted exponential average of the measured cycle counts (avg = 0.995 * avg + 0.005 * curCycles) and repeating the measurement 10000 times.

M4F 测量的时间与 M0 测量的时间相同,因此不幸的是"似乎 SysTick 计数器在 M0 中工作得很好.

The time measured by the M4F is the same as measured by the M0, so "unfortunately" it seems the SysTick counter is working just fine in the M0.

起初我以为这些额外的延迟是由管道停顿引起的,但一方面M0似乎太简单了,另一方面我找不到关于M0管道的任何详细信息,所以我可以不验证.

At first I thought these extra delays were caused by pipeline stalls, but on one hand the M0 seems to be too simple for that, and on the other I can't find any detailed info on the M0's pipeline, so I can't verify.

所以,我的问题是:这里发生了什么?为什么添加一个 nop 使我的函数需要额外的 1000 个循环/循环,但两个 nop 只会将循环计数增加 2?为什么删除指令会使我的代码执行速度变慢?

So, my question is: what is going on here? Why does adding a single nop make my function take an extra 1000 cycles/loop, but do two nops only increase the cycle count by 2? How come removing instructions makes my code execute slower?

推荐答案

mul 指令可以是 ALU 管道的多个周期.您将 c = 25 * a + 5 * b 转换为 c = 5 * (5 * a + b) 需要少一个 mov.但是,管道的加载-存储阶段与 ALU 重叠.这些通常是独立的阶段,通过ldrb 指令,您可以免费获得mov 指令.此外,根据值,muls 可能执行得更快;具体来说,顶部字节为零通常会导致排序器乘法循环.第一个版本中数据依赖要少得多;指令 nn+1 没有共同的寄存器.这是允许管道内衬的基本要求.

The mul instruction can be multiple cycles of the ALU pipe. Your transformation of c = 25 * a + 5 * b into c = 5 * (5 * a + b) requires one less mov. However, the load-store stage of the pipeline over-lays with the ALU. These are often separate stages and with a ldrb instruction you can get mov instructions for free. Also, depending on the values, the muls may execute faster; specifically, the top bytes being zero often result in a sorter multiply cycle. There are far less data dependencies in the first version; instruction n has no registers in common with n+1. This is a basic requirement to allow pipe-lining.

比较,

ldrb r5, [r6, #RoundStep]  ; 2 cycles
add r3, r5                 ; must block for r5 to load (1 cycle)

与,

ldrb r3, [r6, #RoundStep]  ; 2 cycles
movs r4, #5                ; may run in parallel with above.

因此,即使您可能将指令数加起来并使用更少的代码,但由于管道教学安排.

So even though you may add up the instruction count and have less code, it can turn out that a larger alternate will run faster due to pipe-lining or instruction scheduling.

如果您可以将 ldrb 重新定位到例程的开头,则第二个nd 版本可能会更快.

The 2nd version may be faster if you can relocate the ldrb towards the beginning of the routine.

这篇关于ARM Cortex-M0 上的不稳定周期计数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆