如何减少阶乘循环的执行时间和周期数?和/或代码大小? [英] How do I reduce execution time and number of cycles for a factorial loop? And/or code-size?

查看：22 发布时间：2021/11/17 22:21:30 assembly arm execution-time micro-optimization cortex-m3

本文介绍了如何减少阶乘循环的执行时间和周期数?和/或代码大小?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

基本上，我很难让执行时间低于现在，以及减少时钟周期和内存大小.有没有人知道我如何做到这一点?代码运行良好，我只是想稍微改变一下.

写了一个可以工作的代码，但又不想把代码弄乱，也不知道要做什么修改.

<代码>;使用简单循环计算阶乘值;设置异常地址拇指区域重置，代码，只读出口 __Vectors导出 Reset_Handler__向量DCD 0x00180000 ;栈顶DCD Reset_Handler ;重置向量 - 程序开始的地方区域 2a_代码、代码、只读Reset_Handler入口开始移动 r1,#0 ;计算执行的乘法次数MOV r2,#3 ;阶乘计算中的最终值MOV r3,#1 ;阶乘结果将存储在这里;循环 r2 次形成产品事实添加 r1,r1,#1 ;找到下一个被乘数MUL r3,r1,r3 ;形成下一个产品 - 请注意 MUL r3,r3,r1 给出不可预测的输出CMP r1,r2 ;检查是否已达到最终值BMI 事实；如果所有产品尚未形成，则继续出口    ;陷入无限循环B出口结尾

目前的结果是:内存大小:0x00000024时钟周期:22总执行时间:1.1 微秒

我们正在使用 Cortex M3

我只需要减少其中的任何一个，对代码的更改可以很小，只要它产生不同的结果即可.

解决方案

通常代码大小和性能需要权衡.展开循环通常有助于提高性能(至少对于大输入而言)，但需要循环外的额外逻辑来处理清理等.

<小时>

这个答案的大部分内容是假设使用更高性能的 CPU，如 Cortex-A9 或 Cortex-A53，其中通过软件流水线创建指令级并行性会有所帮助.Cortex M3 是标量的，并且具有单周期乘法指令，因此优化起来更加简单.

(最初的问题没有指定内核，我以为即使是低端CPU也会有多周期mul延迟.我在编写它后才找到Cortex-M3数字.)

您的代码可能会遇到整数乘法延迟的瓶颈.与 add 不同，结果将在下一个周期准备好，mul 很复杂，需要多个周期才能产生结果.

(除了一些时钟非常慢的芯片，比如 Cortex-M3 显然有一个 1 周期的 mul 指令.但是 Cortex-M0/M0+/M23 可以选择该指令的 1 个周期或 32 个周期的性能！慢迭代 = 更小的芯片.)><小时>

乘法执行单元本身通常是流水线式的，因此多个独立乘法可以同时进行，但是您的阶乘循环需要每个乘法结果作为下一次迭代的输入.(仅适用于更高性能的内核，不适用于 Cortex-M 系列.慢速 Cortex-M 芯片上的 32 周期乘法是迭代的并且可能没有流水线化，因此在运行时无法启动另一个乘法，并且没有任何好处除了减少循环开销外，还可以公开任何指令级并行性.)

注意乘法是结合的:1 * 2 * 3 = 3 * 2 * 1，所以我们可以从n开始倒数，正如@ensc 的回答所指出的那样.或者 (1*2) * (3*4) = 1*2*3*4.

我们可以代替 1 * 2 * ... * (n/2) 与 n/2+1 * n/2+2 * n/并行2+3 * ... * n，在这两个依赖链上交错工作. 或者我们可以将 1 * 3 * 5 * ... * n 与 2 * 4 * 6 * ... n-1，在一个循环中执行 n -= 2 并从中计算 n+1 .(最后，你将这两个乘积相乘).

这显然需要更多的代码大小，但对性能有很大帮助.

<小时>

当然，查找表是另一种解决方法.如果您只关心不会溢出 32 位结果的输入，那么这是一个非常小的表.但这会带来巨大的尺寸成本.

<小时>

即使在有序 CPU 上(指令执行必须按程序顺序开始)，长时间运行的指令(如缓存未命中加载或乘法)也可以完成乱序，例如一些 add 指令可以在启动 mul 之后但在 mul 结果被写回之前运行.或者甚至在较早的 mul 延迟的阴影下启动另一个独立的 mul 指令.

我在谷歌上搜索了一些 ARM 性能数据，以了解典型情况.

例如，Cortex-A9 是一种较旧的相当常见的高端 ARMv7超标量 CPU(每个周期多条指令)乱序执行.

mul 需要" 2 个周期，并且有 4 个周期的结果延迟.他们没有解释非延迟成本的含义.也许这就是执行单元的倒数吞吐量，比如你多久可以开始一个新的独立操作.这是一个乱序的 CPU，因此将其他指令拖延 2 个周期是没有意义的.在 NEON SIMD 说明部分，他们解释了什么看起来像相同的周期"数:

<块引用>

这是特定指令消耗的发出周期数，如果不存在操作数互锁，则是每条指令的绝对最小周期数.

(操作数互锁 = 等待输入操作数准备就绪，如果较早的指令尚未产生结果).

(Cortex-A9 确实支持压缩整数乘法，因此对于大型阶乘，您可以使用 vmul.32 q1, q1, q2.或者每个 22 个周期使用 64 位 d 寄存器，但是你需要更多的 vadd 指令，与乘法不同的是，vadd.32 同样快使用 128 位 q regs 与使用 64 位向量一样.因此，如果您使用足够多的寄存器来隐藏大延迟，那么 SIMD 可以为您提供两倍于 Cortex-A9 标量的乘法吞吐量.但 SIMD 会可能只对 n 大到 n! 溢出一个 32 位整数有用，所以你得到一个模 2^32 的结果.)

<小时>

低延迟 ARM 乘法指令:

mul 是32x32 => 32 位乘法.在 Cortex-A9 上，它有 2c 的吞吐量和 4c 的延迟.

(muls 是拇指模式下的 16 位指令，除非您不需要破坏标志，否则应该首选.mul 在拇指模式下仅适用于ARMv6T2 及更高版本.)

smulbb 是一个 16x16 => 32 位有符号乘法，它只读取其输入的低半部分，但在 A9 上具有 1c 的吞吐量和 3c 的延迟.(BB = 底部，底部.其他组合也可用，以及乘法累加和各种时髦的东西.)

smulxy 没有 2 字节的 Thumb 版本，因此代码大小比 muls 更糟糕.

不幸的是 smulxy 在未签名的版本中不可用，因此限制了我们可以使用它的输入范围为正 int16_t，而不是 uint16_t.

但是如果我们只关心最终的 32 位结果不会溢出的情况，我们可以安排我们的操作顺序，使最后一个乘法有 2 个相似量级的输入(都是大 16 位数字).即尽可能接近 sqrt(n!).所以例如赔率和偶数的乘积是合理的，但是 (n-1)！* n 将是最坏的情况，因为这需要 (n-1)! 以适应 16 位.实际上，最坏的情况是从 n 开始倒计时，所以最后一个是乘以 3 然后乘以 2.我们可以将乘以 2 的特殊情况作为左移...

<小时>将这些部分放在一起，请注意乘以 1 是一个空操作(除了 smulbb 它将输入截断为 16 位).因此，我们可以根据输入是奇数还是偶数，以在乘以 1 或 2 后停止的方式展开.
所以我们不知道哪个是奇数，哪个是偶数，我们只有 lo(以 n-1 开头)和 hi(以 n 开头).

<代码>;;未经测试，但在 sed -i 's/;/@/' arm-fact.S 之后确实使用 GNU 汇编器进行了汇编;;并将拇指替换为;.拇指;.syntax 统一拇指;;输入:r0 中的 n.(n 为正数，否则返回 n.);;输出:n！在 r0.;;破坏者:r1、r2、r3;;前置条件:n！<2^31.或者可能略低.事实:潜艇 r3, r0, #3 ;r3 = lo = n-3(loprod 的第一个乘数)bls .Ltiny_input子 r2, r0, #2 ;r2 = hi = n-2(hiprod 的第一个乘数)潜艇 r1, r0, #1 ;r1 = loprod = n-1;r0 = hiprod = n.Lloop: ;做 {smulbb r0,r0, r2 ;Hiprod *= 嗨潜艇 r2, #2 ;hi -= 2 for next itersmulbb r1,r1, r3潜艇 r3, #2 ;lo -= 2 用于下一个迭代bgt .Lloop ;while((lo-=2) > 0);签署条件;r3 = 0 或 -1，r2 = 1 或 0.最后的乘法是:;对于偶数 n，hiprod *= 2 和 loprod *= 1;或 hiprod *= 3 和 loprod *= 2 表示奇数 n;muls r0, r1smulbb r0,r0, r1 ;返回 Hiprod *= loprodbx lr ;或内联这个.Ltiny_input: ;微小输入的备用返回路径;r0 = n.标志仍然从 n - 3 设置IT EQ ;GAS 坚持使用显式 IT 来实现拇指模式moveq r0, #6 ;3！= 6，否则 n！= n 对于较小的 n=1 或 2.;0！= 1 种情况未处理，负输入也未处理bx lr

(标签名称中的 .L 使其成为不会出现在目标文件中的本地标签，至少在 GAS 语法中是这样.如果您使用该汇编程序，则可能不会出现在 ARMASM 中.)

ARM 汇编允许您在目标与第一个源相同时省略目标，对于某些指令，例如 subs 而不是 smulbb.如果你愿意，你可以每次都像 subs r2, r2, #2 一样写出来.

您可能会使用 muls r0, r1 作为最终产品，因为最终的 hiprod 比 loprod.即使hiprod > max int16_t，产品也可能不会溢出.这也将节省 2 字节的代码大小，但会在 Cortex-A9 上增加 1 个周期的延迟.(顺便说一句，ARMv6 修复了具有 mul d,d, src 怪异性的不可预测的结果"，并且您的代码使用了 32 位 Thumb2 指令，因此它只能在 ARMv6T2 及更高版本上运行.)

<小时>产品有 2 个累加器，这可能会在 Cortex-A9 上以每 3 个周期运行 2 次乘法，这在很大程度上取决于 CPU 微架构及其前端是否能跟上.在有序 ARM 上，我会担心它能否在乘法完成之前启动其他指令.
最好在 sub 而不是 subs 上多花 2 个字节，这样我们就可以在分支前几条指令计算标志，也许可以减少分支错误预测的惩罚并避免有序 CPU 上的停顿.smulbb 不接触标志，所以我们可以先做 loprod 并且让 hi 的东西不接触标志.
.loop: ;做 {smulbb r1, r3 ;loprod *= lo潜艇 r3, #2 ;lo -= 2 for next iter，并设置标志smulbb r0, r2 ;Hiprod *= 嗨子 r2, #2 ;hi -= 2 for next iter(无标志)bgt .loop ;while((lo-=2) >= 0);
请注意，我们正在修改 r3 和 r2 之后 smulbb 读取它们，避免造成停顿用于有序芯片的数据依赖.
<小时>您正在使用 Thumb 模式并针对代码大小进行优化，因此了解哪些指令的哪些形式可以使用 2 字节/16 位编码以及哪些仅可用作 32 位 Thumb2 非常重要编码.
subs Rd, Rn, #imm 可以编码为 16 位 Thumb 指令，用于 imm=0..7(3 位立即数).或者使用与 src 和目标相同的寄存器，对于 imm=0..255.所以我的复制和订阅指令很紧凑.
非标志设置 sub 不能是 16 位指令，除非在 IT 块内，或者以 SP 作为操作数.
Thumb 模式下的谓词指令，例如 moveq r0, #6，要求汇编器使用 IT 指令 介绍接下来最多 4 条指令的预测.在 ARM 模式下，每条指令的前 4 位表示预测.(如果您不使用后缀，则汇编器会将其编码为 ALways，即不谓词.)
我们可以用另外 4 或 6 个字节来处理 n==0 情况，使用 cmp r0,#0/moveq r0，#1.如果我们将 tst/mov 放在同一个 IT 块中，也许可以将其减少到 4 个字节.IT 不会对实际的标志条件进行快照，而是对哪个谓词进行快照，因此 IT 块内的标志设置指令可能会对同一块中的后续指令产生影响.(我认为这是对的，但我不是 100% 确定).
tiny_input: ;r0 = n，根据 n-3 设置标志情商移动 r0，#6cmpne r0, #0移动 r0，#1
或者有 16 位 cbnz 有条件地跳过 mov r0, #1.但是分支目标必须在 cbnz 之后的 4 到 130 个字节之间，因此显然我们不能跳过单个 16 位指令！
<小时>我的版本的代码大小:
$ arm-none-eabi-gcc -g -c -mcpu=cortex-a9 arm-fact.S$ arm-none-eabi-objdump -drwC arm-fact.oarm-fact.o:文件格式 elf32-littlearm.text 节的反汇编:00000000 <事实>:0: 1ec3 subs r3, r0, #32:d90b bls.n 1c <.tiny_input>4: 1e82 subs r2, r0, #26: 1e41 subs r1, r0, #100000008 <.loop>:8: fb10 f002 smulbb r0, r0, r2c: 3a02 subs r2, #2e: fb11 f103 smulbb r1, r1, r312: 3b02 subs r3, #214: dcf8 bgt.n 8 <.loop>16: fb10 f001 smulbb r0, r0, r11a: 4770 bx lr0000001c<.tiny_input>:1c: bf08 it eq1e: 2006 moveq r0, #620: 4770 bx lr
所以这个函数是 0x22 字节.(如果我们要处理 0!=1，则为 0x26.)
它比您的版本大(您的字节数包括内存中的一些常量，以及生成输入的 mov 指令)，但理论上对于大输入，在具有流水线乘法器).对于从 1 到 3 的输入，可能要快得多，它只分支一次并产生结果.
<小时>您可能没有 Cortex-A9 之类的东西，因为您的 1.1 微秒 = 22 个时钟周期意味着 20MHz 时钟速度，而 Cortex-A9 可在 0.8 到 2GHz 之间使用.>
所以也许你有一个更简单的有序内核，比如 Cortex M3?M3 确实支持mul 指令和Thumb2 模式.维基百科说它的乘法是1个周期！所以这很奇怪，我很惊讶它有这么有效的乘数.或者只是因为它的时钟太慢以至于在 1 级中有很多门延迟时间，而且它只是一个 3 级流水线.
<小时>Cortex-M3 版本:
subs 和 muls 在 Cortex-M3 上是单周期的.我没有在分支上找到性能数字，但它们很常见，所以我假设它可能是 1 个周期并且不会导致大的提取气泡(如果正确预测......).Cortex-M3 HTML 手册有一节关于 分支目标转发 这似乎是为了减少提取气泡.
其指令时序表 显示 b<cond> 花费 1 个周期用于未采用，或 2 个周期用于采用.(1 为分支，1 为管道在立即位移后重新加载.).因此，与 sub/mul 相比，采用的分支慢并且展开将很有价值，所以我上面的代码应该仍然可以正常工作.(但多个积累加器不是必须的，所以可以简化).
优化代码大小:

<代码>;;未经测试拇指;;输入:r0 中的 n.(n 为正数，否则返回 n.);;输出:n！在 r0.;;破坏者:r1事实:潜艇 r1, r0, #1 ;我 = n-1bls .Ltiny_input ;如果 n<=1 则跳转.Lloop: ;做 {muls r0, r1 ;产品 *= i潜艇 r1, #1 ;- 一世bgt .Lloop ;while(--i > 0);签署条件;r1 = 0，r0 = n！;最后一次乘法是一个多余的 prod *= 1 但避免它需要一个 cmp.Ltiny_input: ;微小输入的备用返回路径;0！= 1 种情况未处理，负输入也未处理bx lr ;或内联这个

我认为这是我们可以管理的最小的.该循环有 3 条指令，每次迭代可能花费 4 个周期(1 + 1 + 2，所采取的分支花费 2 个周期).

00000000 <事实>:0: 1e41 subs r1, r0, #12: d902 bls.n a <fact+0xa>4: 4348 muls r0, r16: 3901 subs r1, #18: dcfc bgt.n 4 <fact+0x4>a: 4770 bx lr # 如果内联就不要计算这个

所以这是 0xa = 10 个字节，不包括 bx lr 返回指令.

我们可以处理0！= 1 在第一个 subs 之后，分支 之后有一个 IT 块，所以我们仍然可以跳转到循环之后(而不是像我的 Cortex-A9 版本那样的单独块).不过，你也可以使用这个技巧.

 subs r1, r0, #1 ;我 = n-1它它movlt r0, #1 ;对于 n<1，n = 1bls .Ltiny_input ;如果 n <=1，则返回 n

如果我们需要更多的分支范围，我们可以使用 itt ls/movls r0, #1，所以分支在 IT 块内(分支指令的位置)可以使用在位移上花费更多位而在谓词上不花费更多位的编码).但是在这种情况下它的范围很短，所以我选择在 r0 == 1 情况下保持 r0 不变.我不知道是否有任何 CPU 将谓词指令作为 NOP 而不是运行时的效率更高或延迟更低，但可能有.

<小时>

如果不展开，在循环中放置一个 cmp 以避免最后一次 *=1 迭代将花费我们每次迭代的额外周期(4 个周期而不是 3 个)，所以只需使用 n=2 或者 n=3 来支付自己的费用.

展开可以显着提高较大输入的速度，从每 3 个周期 1 mul 到每 2 个周期渐近接近 1 mul(sub + mul + 分摊循环开销).我看不出有什么方法可以避免像 sub 或 mov 这样的指令为每个 mul 生成单独的输入，除非通过硬编码每个 n 的特殊情况序列(例如 *2 * 4 = *8 = 左移 3)，而您可以改为硬编码答案.

Basically I'm having a hard time getting the execution time any lower than it is, as well as reducing the amount of clock cycles and memory size. Does anyone have any idea on how I can do this? The code works fine I just want to change it a bit.

Wrote a working code, but don't want to mess up the code, but also don't know what changes to make.

; Calculation of a factorial value using a simple loop

; set up the exception addresses
THUMB
AREA RESET, CODE, READONLY
EXPORT  __Vectors
EXPORT Reset_Handler
__Vectors 
DCD 0x00180000     ; top of the stack 
DCD Reset_Handler  ; reset vector - where the program starts

AREA 2a_Code, CODE, READONLY
Reset_Handler
ENTRY
start   
MOV r1,#0    ; count the number of multiplications performed 
MOV r2,#3    ; the final value in the factorial calculation
MOV r3,#1    ; the factorial result will be stored here

; loop r2 times forming the product  
fact
ADD r1,r1,#1  ; find the next multiplicand
MUL r3,r1,r3  ; form the next product - note that MUL r3,r3,r1 gives unpredictable output
CMP r1,r2     ; check if the final value has been reached
BMI fact      ; continue if all products have not been formed

exit    ; stay in an endless loop 
B exit
END

The current results are: Memory Size: 0x00000024 Clock Cycles: 22 Total Execution Time:1.1 Micro seconds

We are working with the Cortex M3

I just need any of these to be reduced, the changes to the code can be minor as long as it produces different results.

解决方案

Often code-size and performance are a tradeoff. Unrolling a loop often helps performance (for large inputs at least), but requires extra logic outside the loop to handle the cleanup and so on.

Most of this answer was assuming a higher-performance CPU like Cortex-A9 or Cortex-A53 where software pipelining to create instruction-level parallelism would be helpful. Cortex M3 is scalar and has a single-cycle multiply instruction, making it much simpler to optimize for.

(The original question didn't specify a core, and I was expecting that even low-end CPUs would have multi-cycle mul latency. I only found Cortex-M3 numbers after writing it.)

Your code will probably bottleneck on the latency of integer multiply. Unlike add, where the result will be ready the next cycle, mul is complex and takes multiple cycles to produce a result.

(Except on some very slowly-clocked chips, like apparently Cortex-M3 has a 1-cycle mul instruction. But Cortex-M0/M0+/M23 are available with a choice of 1 cycle or 32 cycle performance for that instruction! Slow iterative = smaller silicon.)

The multiply execution unit itself is often pipelined so multiple independent multiplies can be in flight at once, but your factorial loop needs each multiply result as an input to the next iteration. (Only for higher-performance cores, not Cortex-M series. The 32-cycle multiply on slow cortex-M chips is iterative and presumably not pipelined, so another multiply couldn't start while it's running, and there'd be no benefit to exposing any instruction-level parallelism beyond reducing loop overhead.)

Notice that multiplication is associative: 1 * 2 * 3 = 3 * 2 * 1, so we can count down from n, as @ensc's answer points out. Or (1*2) * (3*4) = 1*2*3*4.

We could instead do 1 * 2 * ... * (n/2) in parallel with n/2+1 * n/2+2 * n/2+3 * ... * n, interleaving work on those two dependency chains. Or we could interleave 1 * 3 * 5 * ... * n with 2 * 4 * 6 * ... n-1, in a loop that did n -= 2 and calculates n+1 from that. (Then at the end, you multiply those 2 products).

This is obviously going to require more code-size, but could help performance a lot.

Of course, a lookup table is another workaround. If you only care about inputs that don't overflow a 32-bit result, that's a pretty small table. But that has a significant size cost.

Even on an in-order CPU (where instruction execution has to start in program order), long-running instructions like cache-miss loads, or multiplies, may be allowed to complete out of order, so e.g. some add instructions could run after starting a mul but before the mul result was written back. Or even starting another independent mul instruction in the shadow of an earlier mul's latency.

I googled some ARM performance numbers to maybe get a feel for what's typical.

For example, Cortex-A9 is an older fairly common high-end ARMv7 CPU that is superscalar (multiple instructions per cycle) with out-of-order execution.

mul "takes" 2 cycles, and has 4 cycle result latency. They don't explain what they mean by the non-latency cost. Perhaps that's the reciprocal throughput of the execution unit, like how often you can start a new independent operation. It's an out-of-order CPU so it doesn't make sense for it to stall other instructions for 2 cycles. In the NEON SIMD instruction section, they explain what looks like the same "cycles" number:

This is the number of issue cycles the particular instruction consumes, and is the absolute minimum number of cycles per instruction if no operand interlocks are present.

(operand interlocks = waiting for an input operand to be ready, if an earlier instruction hasn't produced a result yet).

(Cortex-A9 does support packed integer multiplication, so for large factorials you could look at doing 4 multiplies in parallel starting one vector per 4 cycles, using vmul.32 q1, q1, q2. Or 2 per 2 cycles with 64-bit d registers, but then you'd need more vadd instructions and unlike multiply, vadd.32 is just as fast with 128-bit q regs as with 64-bit vectors. So SIMD can give you twice the multiply throughput of scalar on Cortex-A9, if you use enough registers to hide the large latency. But SIMD would probably only be useful with n so large that n! overflows a 32-bit integer, so you get a result modulo 2^32.)

Lower latency ARM multiply instructions:

mul is a 32x32 => 32-bit multiply. On Cortex-A9, it has 2c throughput and 4c latency.

(muls is a 16-bit instruction in thumb mode, and should be preferred unless you need to not clobber the flags. mul in Thumb mode is only available in ARMv6T2 and later.)

smulbb is a 16x16 => 32-bit signed multiply that only reads the low half of its inputs, but has 1c throughput and 3c latency on A9. (BB = bottom, bottom. The other combinations are also available, along with multiply-accumulate and various funky things.)

There is not 2-byte Thumb version of smulxy, so this is worse for code-size than muls.

Unfortunately smulxy isn't available in an unsigned version, so that limits the range of inputs we can use it with to positive int16_t, not uint16_t.

But if we only care about the case where the final 32-bit result doesn't overflow, we can arrange our order of operations so the last multiply has 2 inputs of similar magnitude (both large-ish 16-bit numbers). i.e. as close to sqrt(n!) as possible. So e.g. the product of odds and evens would be reasonable, but (n-1)! * n would be the worst case because that would require (n-1)! to fit in 16 bits. Actually the worst case would be counting down from n so the last one is a multiply by 3 then 2. We could special case the multiply by 2 to a left shift...

Putting these pieces together, notice that multiplying by 1 is a no-op (except with smulbb where it truncates the input to 16 bit). So we can unroll in a way that stops after a multiply by 1 or 2 depending on the input being odd or even.

So instead of knowing which is odd and which is even, we just have lo (starting with n-1) and hi (starting with n).

;; UNTESTED, but it does assemble with the GNU assembler, after sed -i 's/;/@/' arm-fact.S
;; and replacing THUMB with
; .thumb
; .syntax unified
THUMB

;; Input: n in r0.   (n is signed positive, otherwise we return n.)
;; Output: n! in r0.
;; clobbers: r1, r2, r3
;; pre-conditions: n! < 2^31.  Or maybe slightly lower.
fact:
    subs   r3, r0, #3   ; r3 = lo = n-3  (first multiplier for loprod)
    bls   .Ltiny_input
    subs   r2, r0, #2   ; r2 = hi = n-2  (first multiplier for hiprod)
    subs   r1, r0, #1   ; r1 = loprod = n-1
                        ; r0 = hiprod = n

.Lloop:                 ; do {
    smulbb  r0,r0, r2      ; hiprod *= hi
    subs    r2, #2         ; hi -= 2 for next iter
    smulbb  r1,r1, r3
    subs    r3, #2         ; lo -= 2 for next iter
    bgt     .Lloop       ; while((lo-=2) > 0);  signed condition
    ; r3 = 0 or -1, r2 = 1 or 0.  The last multiplies were:
    ;       hiprod *= 2 and loprod *= 1  for even n
    ;   or  hiprod *= 3 and loprod *= 2  for odd n

    ; muls  r0, r1
    smulbb  r0,r0, r1      ; return  hiprod *= loprod

    bx lr    ; or inline this

.Ltiny_input:   ; alternate return path for tiny inputs
    ; r0 = n.   flags still set from  n - 3
    IT eq                  ; GAS insists on explicit IT for thumb mode
    moveq   r0, #6         ; 3! = 6, else n! = n for smaller n=1 or 2.
                           ; 0! = 1 case is not handled, nor are negative inputs
    bx lr

(.L in a label name makes it a local label that doesn't show up in the object file, at least in GAS syntax. Maybe not in ARMASM, if you're using that assembler.)

ARM assembly lets you leave out the destination when it's the same as the first source, for some instructions like subs but not smulbb. You could write it out like subs r2, r2, #2 every time if you want.

You might use muls r0, r1 for the final product, because the final hiprod is a bit higher than loprod. The product might not overflow even if hiprod > max int16_t. That would save 2 bytes of code-size, too, but add 1 cycle of latency on Cortex-A9. (BTW, ARMv6 fixed the "unpredictable result" with mul d,d, src weirdness, and your code used 32-bit Thumb2 instructions, thus it only works on ARMv6T2 and above anyway.)

With 2 accumulators for the products, this can possibly run at 2 multiplies per 3 cycles on Cortex-A9, depending greatly on the CPU micro-architecture and whether its front-end can keep up. On an in-order ARM, I'd be worried about it being able to start other instructions before a multiply finished.

It might be better to spend 2 extra bytes on sub instead of subs so we can compute the flags a couple instructions ahead of the branch, maybe reducing branch mispredict penalty and avoiding stalls on in-order CPUs. smulbb doesn't touch flags, so we can do loprod first and have the hi stuff not touch flags.

.loop:                  ; do {
    smulbb  r1, r3       ; loprod *= lo
    subs    r3, #2       ; lo -= 2 for next iter, and set flags
    smulbb  r0, r2       ; hiprod *= hi
    sub     r2, #2       ; hi -= 2 for next iter (no flags)
    bgt     .loop       ; while((lo-=2) >= 0);

Note that we're modifying r3 and r2 right after smulbb reads them, avoiding creating a stall for the data dependency on in-order chips.

You're using Thumb mode and optimizing for code-size, so it's important to know which forms of which instructions can use a 2-byte / 16-bit encoding and which are only available as 32-bit Thumb2 encodings.

subs Rd, Rn, #imm can be encoded as a 16-bit Thumb instruction for imm=0..7 (3-bit immediate). Or with the same register as src and destination, for imm=0..255. So my copy-and-sub instructions are compact.

Non-flag-setting sub can't be a 16-bit instruction except inside a IT block, or with SP as the operand.

Predicated instructions in Thumb mode, like moveq r0, #6, require the assembler to use an IT instruction to introduce predication for the next up-to-4 instructions. In ARM mode, the top 4 bits of every instruction signals predication. (If you don't use a suffix, the assembler encodes it as ALways, i.e. not predicated.)

We could handle the n==0 case with another 4 or 6 bytes, with cmp r0,#0 / moveq r0, #1. Maybe getting it down to 4 bytes if we put the tst / mov inside the same IT block. IT doesn't snapshot the actual flag condition, it snapshots which predicate, so flag-setting instructions inside an IT block can have an effect on later instructions in the same block. (I think this is right, but I'm not 100% sure).

tiny_input:    ; r0 = n,  flags set according to n-3
    ITET EQ
    moveq  r0, #6
    cmpne  r0, #0
    moveq  r0, #1

Or there's 16-bit cbnz to conditionally jump over a mov r0, #1. But the branch target must be from 4 to 130 bytes after the cbnz, so we can't jump over just a single 16-bit instruction, apparently!

Code-size for my version:

$ arm-none-eabi-gcc -g -c -mcpu=cortex-a9 arm-fact.S
$ arm-none-eabi-objdump -drwC arm-fact.o 

arm-fact.o:     file format elf32-littlearm


Disassembly of section .text:

00000000 <fact>:
   0:   1ec3            subs    r3, r0, #3
   2:   d90b            bls.n   1c <.tiny_input>
   4:   1e82            subs    r2, r0, #2
   6:   1e41            subs    r1, r0, #1

00000008 <.loop>:
   8:   fb10 f002       smulbb  r0, r0, r2
   c:   3a02            subs    r2, #2
   e:   fb11 f103       smulbb  r1, r1, r3
  12:   3b02            subs    r3, #2
  14:   dcf8            bgt.n   8 <.loop>
  16:   fb10 f001       smulbb  r0, r0, r1
  1a:   4770            bx      lr

0000001c <.tiny_input>:
  1c:   bf08            it      eq
  1e:   2006            moveq   r0, #6
  20:   4770            bx      lr

So it's 0x22 bytes for this function. (Or 0x26 if we want to handle 0! = 1.)

It's larger than your version (your byte count includes some constants in memory, and the mov instructions to produce input), but in theory maybe better than twice as fast for large input, on CPUs with pipelined multipliers). And maybe much faster for inputs from 1 to 3, where it just branches once and produces the result.

You probably don't have anything like a Cortex-A9, because your 1.1 microseconds = 22 clock cycles means a 20MHz clock speed, while Cortex-A9 was available in 0.8 to 2GHz.

So maybe you have a much simpler in-order core like Cortex M3? M3 does support the mul instruction, and Thumb2 mode. And wikipedia says its multiply is 1 cycle! So that's weird, I'm surprised it has that efficient a multiplier. Or just that it clocks so slowly that there's time for a lot of gate delays in 1 stage, and it's only a 3-stage pipeline.

Cortex-M3 version:

subs and muls are single-cycle on Cortex-M3. I haven't found perf numbers on branches, but they're common so I'm assuming it's probably 1 cycle and doesn't cause a big fetch bubble (if correctly predicted...). The Cortex-M3 HTML manual has a section on Branch target forwarding which appears to be about reducing the fetch bubble.

Its instruction timing table shows b<cond> costs 1 cycle for not-taken, or 2 cycles for taken. (1 for the branch, 1 for the pipeline reload after an immediate displacement.). So taken branches are slow compared to sub/mul and unrolling would be valuable, so my code above should still work well. (But multiple product accumulators are not necessary, so it can be simplified).

Optimizing for code-size:

;; UNTESTED
THUMB

;; Input: n in r0.   (n is signed positive, otherwise we return n.)
;; Output: n! in r0.
;; clobbers: r1
fact:
    subs   r1, r0, #1     ; i = n-1
    bls   .Ltiny_input    ; jump if n<=1

.Lloop:                 ; do {
    muls    r0, r1         ; prod *= i
    subs    r1, #1         ; --i
    bgt     .Lloop      ; while(--i > 0);  signed condition
    ; r1 = 0, r0 = n! 
    ; last multiply was a redundant prod *= 1 but avoiding that would take a cmp
.Ltiny_input:   ; alternate return path for tiny inputs
    ; 0! = 1 case is not handled, nor are negative inputs


    bx lr    ; or inline this

I think that's the smallest we can manage. The loop has 3 instructions, and probably costs 4 cycles per iteration (1 + 1 + 2, the taken branch costing 2 cycles).

00000000 <fact>:
   0:   1e41            subs    r1, r0, #1
   2:   d902            bls.n   a <fact+0xa>
   4:   4348            muls    r0, r1
   6:   3901            subs    r1, #1
   8:   dcfc            bgt.n   4 <fact+0x4>
   a:   4770            bx      lr           # don't count this if inlining

So this is 0xa = 10 bytes, not counting the bx lr return instruction.

We could handle the 0! = 1 case with an IT block after the first subs, before the branch, so we can still jump to right after the loop (instead of to a separate block like my Cortex-A9 version). You could use this trick for it, too, though.

    subs   r1, r0, #1     ; i = n-1
    it lt
    movlt  r0, #1         ; n = 1 for  n<1
    bls   .Ltiny_input    ; return n if n was <=1

If we needed more range for the branch, we could use itt ls / movls r0, #1, so the branch was inside the IT block (where branch instructions can use an encoding that spends more bits on displacement and none on the predicate). But it's a short range in this case, so I chose to leave r0 unmodified in the r0 == 1 case. I don't know if there are any CPUs where it's more efficient or lower latency for a predicated instruction to be a NOP instead of running, but there might be.

Without unrolling, putting a cmp in the loop to avoid the last *=1 iteration would cost us an extra cycle per iteration (4 cycles instead of 3), so only pay for itself with n=2 or maybe n=3.

Unrolling could help speed significantly for larger inputs, going from 1 mul per 3 cycles to asymptotically approaching 1 mul per 2 cycles (sub + mul + amortized loop overhead). I can't see any way to avoid an instruction like sub or mov to generate a separate input for each mul, except by hard-coding special case sequences for each n (like *2 * 4 = *8 = left shift by 3) when you could instead just hard-code the answer.

这篇关于如何减少阶乘循环的执行时间和周期数?和/或代码大小?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何减少阶乘循环的执行时间和周期数?和/或代码大小? [英] How do I reduce execution time and number of cycles for a factorial loop? And/or code-size?

问题描述

这个答案的大部分内容是假设使用更高性能的 CPU，如 Cortex-A9 或 Cortex-A53，其中通过软件流水线创建指令级并行性会有所帮助.Cortex M3 是标量的，并且具有单周期乘法指令，因此优化起来更加简单.

低延迟 ARM 乘法指令:

我的版本的代码大小:

Cortex-M3 版本:

优化代码大小:

Most of this answer was assuming a higher-performance CPU like Cortex-A9 or Cortex-A53 where software pipelining to create instruction-level parallelism would be helpful. Cortex M3 is scalar and has a single-cycle multiply instruction, making it much simpler to optimize for.

Lower latency ARM multiply instructions:

Code-size for my version:

Cortex-M3 version:

Optimizing for code-size:

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何减少阶乘循环的执行时间和周期数?和/或代码大小? [英] How do I reduce execution time and number of cycles for a factorial loop? And/or code-size?

问题描述

这个答案的大部分内容是假设使用更高性能的 CPU，如 Cortex-A9 或 Cortex-A53，其中通过软件流水线创建指令级并行性会有所帮助.Cortex M3 是标量的，并且具有单周期乘法指令，因此优化起来更加简单.

低延迟 ARM 乘法指令:

我的版本的代码大小:

Cortex-M3 版本:

优化代码大小:

Most of this answer was assuming a higher-performance CPU like Cortex-A9 or Cortex-A53 where software pipelining to create instruction-level parallelism would be helpful. Cortex M3 is scalar and has a single-cycle multiply instruction, making it much simpler to optimize for.

Lower latency ARM multiply instructions:

Code-size for my version:

Cortex-M3 version:

Optimizing for code-size:

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭