INC 指令与 ADD 1:重要吗? [英] INC instruction vs ADD 1: Does it matter?

查看:54
本文介绍了INC 指令与 ADD 1:重要吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

来自 Ira Baxter 的回答,为什么 INC 和 DEC 指令影响进位标记(CF)?

<块引用>

大多数情况下,我现在远离 INCDEC,因为它们会执行部分条件代码更新,这会导致管道中出现有趣的停顿,并且 ADD/SUB 不要.所以在无关紧要的地方(大多数地方),我使用 ADD/SUB 来避免停顿.我使用 INC/DEC 只在保持代码小问题时使用,例如,适合一两条指令的大小足以影响的缓存行.这可能是毫无意义的纳米[字面意思!]优化,但我的编码习惯非常老派.

我想问一下为什么它会导致管道中的停顿而 add 不会?毕竟,ADDINC 都会更新标志寄存器.唯一的区别是INC 不更新CF.但为什么这很重要?

解决方案

(更新 TODO:检查 Efficiency 如何在 Alder Lake 上运行 (Gracemont) 运行 inc reg.如果它影响吞吐量,可能值得花费额外的代码字节.它们是 源自 Silvermont,但更广泛,尽管宽后端可能意味着即使有一个额外的 uop 也不会受到伤害.有了它的 L1i 标记指令边界,节省代码大小可能不是什么大问题.)


TL:DR/针对现代 CPU 的建议:使用 inc 除非有内存目标. 在代码中,您正在调整以在主流 Intel 或任何 AMD 上运行,inc register 没问题.(例如 gcc -mtune=core2-mtune=haswell-mtune=znver1).inc mem 在 Intel P6/SnB 系列上比 add 花费额外的 uop;负载不能微熔断.

如果您关心 Silvermont 系列(包括 Xeon Phi 中的 KNL,以及一些上网本、chromebook 和 NAS 服务器),请避免使用 inc.add 1 在 64 位代码中只需要 1 个额外的字节,在 32 位代码中只需要 2 个字节.但这不是性能灾难(只是在本地使用了 1 个额外的 ALU 端口,不会创建错误的依赖项或大停顿),所以如果您对 SMont 不太那么关心,那么请不要担心.

编写 CF 而不是保持不变可能对其他可能受益于 CF dep-breaking 的周围代码有用,例如转移.见下文.

如果您想在不触及任何标志的情况下进行加减运算,lea eax, [rax+1] 可以高效运行并且具有与 相同的代码大小添加 eax, 1.(不过,通常可能的执行端口比 add/inc 少,因此在销毁 FLAGS 时 add/inc 更好,这不是问题.https://agner.org/optimize/)


在现代 CPU 上,add 永远不会inc(除了间接代码大小/解码效果),但是通常它也不会更快,因此出于代码大小的原因,您应该更喜欢 inc.尤其是在同一个二进制文件中多次重复此选择时(例如,如果您是编译器编写者).

inc 保存 1 个字节(64 位模式),或 2 个字节(操作码 0x40..F inc r32/dec r32 short32 位模式下的形式,重新用作 x86-64 的 REX 前缀).这使得总代码大小的百分比差异很小.这有助于指令缓存命中率、iTLB 命中率以及必须从磁盘加载的页面数.

inc 的优点:

  • 代码大小直接
  • 不使用立即数会对 Sandybridge 系列产生 uop-cache 影响,这可能会抵消 add 更好的微融合.(请参阅 Agner Fog 在其微架构指南 Sandybridge 部分中的表 9.1.)性能计数器可以轻松测量问题 -阶段 uop,但很难衡量事物如何打包到 uop 缓存和 uop 缓存读取带宽影响中.
  • 在某些情况下,不修改 CF 是一个优势,在 CPU 上,您可以在 inc 之后读取 CF 而不会出现停顿.(不适用于 Nehalem 及更早版本.)

现代 CPU 中有一个例外:Silvermont/Goldmont/Knight's Landinginc/dec 有效地解码为 1 uop,但扩展为2 在分配/重命名(又名问题)阶段.额外的 uop 合并了部分标志.inc 吞吐量仅为每个时钟 1, 与 0.5c(或 0.33c Goldmont)相比,独立 add r32, imm8 因为标志合并 uops 创建了 dep 链.

与 P4 不同,寄存器结果在标志上没有 false-dep(见下文),因此当没有使用标志结果时,乱序执行会将标志合并从延迟关键路径中移除.(但是 OOO 窗口比 Haswell 或 Ryzen 等主流 CPU 小得多.)在大多数情况下,将 inc 作为 2 个独立的 uops 运行可能是 Silvermont 的胜利;大多数 x86 指令写入所有标志而不读取它们,打破了这些标志依赖链.

SMont/KNL 在解码和分配/重命名之间有一个队列(参见 Intel 的优化手册,图 16-2),因此在发布期间扩展到 2 uops 可以填充解码停顿(在单操作数 mulpshufb 等指令上)中的气泡code>,从解码器产生超过 1 uop 并导致微码 3-7 个周期停顿).或者在 Silvermont 上,只有一个超过 3 个前缀(包括转义字节和强制前缀)的指令,例如REX + 任何 SSSE3 或 SSE4 指令.但请注意,有一个 ~28 uop 循环缓冲区,因此小循环不会受到这些解码停顿的影响.

inc/dec 不是唯一解码为 1 但发出为 2 的指令:push/popcall/retlea 也有 3 个组件.KNL 的 AVX512 收集指令也是如此.来源:英特尔的优化手册,17.1.2 Out-of-订单引擎(KNL).这只是一个很小的吞吐量损失(有时甚至不是如果其他任何事情都是更大的瓶颈),所以通常仍然使用 inc 来表示generic"调整.


Intel 的优化手册一般仍然推荐 add 1 而不是 inc,以避免部分标志停顿的风险.但由于 Intel 的编译器默认不会这样做,因此未来的 CPU 不太可能使 inc 在所有情况下都变慢,就像 P4 那样.

Clang 5.0 和 Intel 的 ICC 17(在 Godbolt 上) 在优化速度 (-O3) 时确实使用了 inc,而不是只是为了尺寸.-mtune=pentium4 使他们避免了 inc/dec,但默认的 -mtune=generic 没有把P4 的权重很大.

ICC17 -xMIC-AVX512(相当于gcc的-march=knl)确实避免了inc,对于 Silvermont/KNL 来说,这可能是一个不错的选择.但是使用 inc 通常不会造成性能灾难,因此它可能仍然适用于generic";调整为在大多数代码中使用 inc/dec,尤其是当标志结果不是关键路径的一部分时.


除了 Silvermont,这是 Pentium4 遗留下来的大部分陈旧的优化建议.在现代 CPU 上,只有当您实际读取的标志不是由写入 any 标志的最后一个 insn 写入的标志时,才会出现问题.例如在 BigInteger adc 循环中.(在这种情况下,您需要保留 CF,因此使用 add 会破坏您的代码.)

add 将所有条件标志位写入 EFLAGS 寄存器.寄存器重命名使乱序执行的只写变得容易:参见 write-after-write和写后读的危害.add eax, 1add ecx, 1 可以并行执行,因为它们彼此完全独立.(即使 Pentium4 也将条件标志位与 EFLAGS 的其余部分分开重命名,因为即使 add 也未修改启用中断和许多其他位.)

在 P4 上,incdec 依赖于所有标志的先前值,因此它们不能与彼此或之前的标志设置指令.(例如 add eax, [mem]/inc ecx 使 inc 等到 add 之后,即使添加的加载未命中缓存.)这称为错误依赖.部分标志写入的工作方式是读取标志的旧值,更新 CF 以外的位,然后写入完整标志.

所有其他乱序 x86 CPU(包括 AMD 的),分别重命名标志的不同部分,因此它们在内部对除 CF 之外的所有标志进行只写更新.(来源:Agner Fog 的微架构指南).只有少数指令,如adccmc,真正读取然后写入标志.还有 shl r, cl(见下文).


add dest, 1 优于 inc dest 的情况,至少对于英特尔 P6/SnB uarch 系列:

  • Memory-destination:add [rdi], 1 可以在 Intel Core2 和 SnB 系列上微融合存储和加载+添加,因此它是 2 个融合域 uops/4 个未融合域 uops.
    inc [rdi]只能微熔商店,所以是3F/4U.
    根据 Agner Fog 的表格,AMD 和 Silvermont 运行 memory-dest incadd 相同,作为单个宏操作/uop.

但要注意 add [label], 1 的 uop 缓存效应,它需要一个 32 位地址和一个 8 位立即数来实现相同的 uop.

在英特尔 SnB 系列上,可变计数偏移为 3 uop(在 Core2/Nehalem 上为 1).AFAICT,两个uop 读/写标志,一个独立的uop 读取regcl,并写入reg.与吞吐量 (1.5c) 相比,具有更好的延迟(1c + 不可避免的资源冲突)是一种奇怪的情况,并且只有在与破坏标志依赖性的指令混合时才能实现最大吞吐量.(我在 Agner Fog 的论坛上发布了更多相关内容).尽可能使用 BMI2 shlx;它是 1 uop,计数可以在任何寄存器中.

无论如何,inc(写标志但不修改CF)在变量计数shl 之前让它对写的CF 产生错误的依赖最后,在 SnB/IvB 上可能需要额外的 uop 来合并标志.

Core2/Nehalem 设法避免甚至错误的 dep on flags:Merom 运行 6 个独立 shl reg,cl 指令的循环,每个时钟几乎有两个班次,性能与 cl=0 或 cl 相同=13.任何优于每个时钟 1 的情况都证明不存在对标志的输入依赖性.

我尝试使用 shl edx, 2shl edx, 0 循环(立即计数移位),但没有看到 dec 之间的速度差异sub 在 Core2、HSW 或 SKL 上.我不知道 AMD.

更新:英特尔 P6 系列上出色的移位性能是以需要避免的大性能坑为代价的:当指令依赖于移位指令的标志结果时:前端停止,直到指令停用.(来源:英特尔的优化手册,(第 3.5.2.6 节:部分标志寄存器停顿)).所以 shr​​ eax, 2/jnz 对于英特尔 Sandybridge 之前的性能来说是非常灾难性的,我猜!如果您关心 Nehalem 及更早版本,请使用 shr​​ eax, 2/test eax,eax/jnz.英特尔的示例清楚地表明这适用于立即计数移位,而不仅仅是 count=cl.

<块引用>

在基于英特尔酷睿微架构 [这意味着酷睿 2 及更高版本] 的处理器中,立即移位 1 由特殊硬件处理,因此它不会遇到部分标志停顿.

Intel 实际上是指没有立即数的特殊操作码,它会移动一个隐式的 1.我认为使用短编码(使用原始 8086 操作码 D1/5) 产生只写(部分)标志结果,但较长的编码(C1/5, imm8 带有立即数 1) 直到执行时间才检查其立即数是否为 0,但不会跟踪乱序机器中的标志输出.>

由于循环位很常见,但循环每 2 位(或任何其他步幅)非常罕见,这似乎是一个合理的设计选择.这解释了为什么编译器喜欢 test 移位的结果,而不是直接使用 shr​​ 的标志结果.

更新:对于 SnB 系列的可变计数变化,英特尔的优化手册说:

<块引用>

3.5.1.6 可变位计数旋转和移位

<块引用>

在英特尔微架构代号 Sandy Bridge 中,ROL/ROR/SHL/SHR reg, cl"指令具有三个微操作.当不需要标志结果时,这些微操作之一可能会被丢弃,提供在许多常见用法中具有更好的性能.当这些指令更新后续使用的部分标志结果时,完整的三个微操作流程必须经过执行和退出管道,体验较慢的性能.在英特尔微架构代号 Ivy Bridge 中,执行完整的三个微操作流程以使用更新的部分标志结果有额外的延迟.

<块引用>

考虑下面的循环序列:

<块引用>

 循环:shl eax, cl添加 ebx, eax十月 ;DEC不更新进位,导致SHL执行较慢的三个微操作流程jnz循环

DEC 指令不修改进位标志.因此,SHL EAX、CL指令需要执行三个微操作流程后续迭代.SUB 指令将更新所有标志.所以用 SUB 替换 DEC 将允许 SHL EAX, CL 执行这两个微操作流程.


术语

读取标志时会发生部分标志停顿,如果它们确实发生的话.P4 永远不会有部分标志停顿,因为它们永远不需要合并.相反,它具有错误的依赖关系.

几个答案/评论混淆了术语.他们描述了一个错误的依赖关系,但随后将其称为部分标志停顿.这是因为只写了一些标志而发生的放缓,但术语部分标志停顿"当必须合并部分标志写入时,在 SnB 之前的 Intel 硬件上会发生什么.Intel SnB 系列 CPU 插入一个额外的 uop 来合并标志而不会停顿.Nehalem 和更早的停止约 7 个周期.我不确定对 AMD CPU 的惩罚有多大.

(请注意,部分寄存器的惩罚并不总是与部分标志相同,见下文).

### Intel P6 系列 CPU 上的部分标志停顿:bigint_loop:adc eax, [array_end + rcx*4] # adc 读取 CF 时的部分标志停顿inc rcx # rcx 从负值向上计数到零# test rcx,rcx # 通过写入所有标志来消除部分标志停顿,或者更好地使用 add rcx,1锦州# 这个循环没有做任何有用的事情;对于同一个累加器,将进位循环返回到进位通常是没有用的.# 请注意,`test` 会将输入更改为下一个 adc,因此会将 inc 替换为 add 1


在其他情况下,例如部分标志写入后跟完整标志写入,或仅读取 inc 写入的标志,都可以.在 SnB 系列 CPU 上,inc/dec 甚至可以与 jcc 进行宏融合,与 add/sub 相同.

在 P4 之后,Intel 大多放弃了尝试让人们重新编译 -mtune=pentium4 或尽可能多地修改手写 asm 以避免严重瓶颈的尝试.(针对特定的微架构进行调优始终是一件事,但 P4 在弃用许多过去在以前的 CPU 上速度很快的东西方面很不寻常,因此在现有二进制文件中很常见.)P4 希望人们能够使用类似 RISC 的 x86 子集,并且还有分支预测提示作为 JCC 指令的前缀.(它还有其他严重的问题,比如跟踪缓存不够好,解码器弱,这意味着跟踪缓存未命中时的性能很差.更不用说时钟非常高的整个理念遇到了功率密度墙.)

当英特尔放弃 P4 (NetBurst uarch) 时,他们又回到了 P6 系列设计(Pentium-M/Core2/Nehalem),继承了早期 P6 系列 CPU(PPro 到 PIII)的部分标志/部分注册处理这早于 netburst 错误步骤.(并非关于 P4 的所有内容本质上都是糟糕的,一些想法在 Sandybridge 中重新出现,但总体而言 NetBurst 被广泛认为是错误的.)一些非常 CISC 的指令仍然比多指令替代方案慢,例如enter, loop,或bt [mem], reg(因为reg的值影响使用哪个内存地址),但这些都是在较旧的 CPU 中速度较慢,因此编译器已经避免使用它们.

Pentium-M 甚至改进了对部分寄存器的硬件支持(更低的合并惩罚).在 Sandybridge 中,英特尔保留了 partial-flag 和 partial-reg 重命名,并在需要合并时提高了效率(在没有或最小停顿的情况下合并 uop 插入).SnB 进行了重大的内部更改,被认为是一个新的 uarch 家族,尽管它继承了 Nehalem 的很多内容,以及 P4 的一些想法.(但请注意,SnB 的解码 uop 缓存不是跟踪缓存,因此对于 NetBurst 的跟踪缓存试图解决的解码器吞吐量/功率问题,这是一个非常不同的解决方案.)


例如,inc alinc ah 可以在 P6/SnB 系列 CPU 上并行运行,但读取 eax之后需要合并.

PPro/PIII 在读取完整 reg 时停顿 5-6 个周期.Core2/Nehalem 在为部分 regs 插入合并 uop 时仅停顿 2 或 3 个周期,但部分标志仍然是一个更长的停顿.

SnB 插入一个合并 uop 而不会停止,就像标志一样.英特尔的优化指南说,为了将 AH/BH/CH/DH 合并到更广泛的 reg 中,插入合并 uop 需要整个问题/重命名周期,在此期间不能分配其他 uops.但是对于 low8/low16,合并 uop 是流程的一部分",因此除了占用问题/重命名周期中的 4 个插槽之一之外,它显然不会导致额外的前端吞吐量损失.

在 IvyBridge(或至少是 Haswell)中,英特尔放弃了对 low8 和 low16 寄存器的部分寄存器重命名,仅保留对 high8 寄存器 (AH/BH/CH/DH) 的重命名.读取 high8 寄存器有额外的延迟.此外,setcc al 对 rax 的旧值有错误的依赖,这与 Nehalem 和更早版本(可能还有 Sandybridge)不同.请参阅此 HSW/SKL 部分- 注册性能问答了解详情.

(我之前曾声称 Haswell 可以在没有 uop 的情况下合并 AH,但这不是真的,也不是 Agner Fog 的指南所说的.我浏览得太快,不幸的是在许多评论和其他帖子中重复了我的错误理解.)

AMD CPU 和 Intel Silvermont 不会重命名部分 regs(标志除外),因此 mov al, [mem] 对 eax 的旧值有错误的依赖性.(好处是稍后阅读完整 reg 时不会出现部分 reg 合并放缓.)


通常情况下,唯一一次 add 而不是 inc 会使您的代码在 AMD 或主流 Intel 上更快是当您的代码实际上依赖于不接触-inc 的 CF 行为.即 通常 add 仅在它会破坏您的代码时才有帮助,但请注意上面提到的 shl 情况,其中指令读取标志但通常是您的代码不关心这个,所以它是一个错误的依赖.

如果您确实确实希望不修改 CF,那么 SnB 系列之前的 CPU 会出现部分标志停顿的严重问题,但是在 SnB 系列上,让 CPU 合并部分标志的开销是非常低,因此在针对这些 CPU 时,最好继续使用 incdec 作为循环条件的一部分,并进行一些展开.(有关详细信息,请参阅我之前链接的 BigInteger adc 问答).如果您不需要对结果进行分支,那么使用 lea 进行算术运算而不影响标志会很有用.


Skylake 没有部分标志合并成本

更新:Skylake 根本没有部分标志合并 uops:CF 只是一个独立于 FLAGS 其余部分的寄存器.需要两个部分的指令(如 cmovbe)分别读取两个输入.这使得 cmovbe 成为 2-uop 指令,但 Skylake 上的大多数其他 cmovcc 指令为 1-uop.请参阅什么是部分标志摊位?.

adc only 读取 CF,因此它可以在 Skylake 上进行单 uop 操作,无需与 incdec 交互 在同一个循环中.

(待办事项:重写此答案的前面部分.)

From Ira Baxter answer on, Why do the INC and DEC instructions not affect the Carry Flag (CF)?

Mostly, I stay away from INC and DEC now, because they do partial condition code updates, and this can cause funny stalls in the pipeline, and ADD/SUB don't. So where it doesn't matter (most places), I use ADD/SUB to avoid the stalls. I use INC/DEC only when keeping the code small matters, e.g., fitting in a cache line where the size of one or two instructions makes enough difference to matter. This is probably pointless nano[literally!]-optimization, but I'm pretty old-school in my coding habits.

And I would like to ask why it can cause stalls in the pipeline while add doesn't? After all, both ADD and INC updates flag registers. The only difference is that INC doesn't update CF. But why it matters?

解决方案

(Update TODO: check on how the Efficiency cores on Alder Lake (Gracemont) run inc reg. If it hurts throughput there, might be worth spending the extra code byte. They're derived from Silvermont, but much wider, although the wide back-end might mean the extra uop doesn't hurt even if there is one. With its L1i marking instruction boundaries, saving code-size might not be as big a deal for it.)


TL:DR/advice for modern CPUs: Use inc except with a memory destination. In code you're tuning to run on mainstream Intel or any AMD, inc register is fine. (e.g. like gcc -mtune=core2, -mtune=haswell, or -mtune=znver1). inc mem costs an extra uop vs. add on Intel P6 / SnB-family; the load can't micro-fuse.

If you care about Silvermont-family (including KNL in Xeon Phi, and some netbooks, chromebooks, and NAS servers), probably avoid inc. add 1 only costs 1 extra byte in 64-bit code, or 2 in 32-bit code. But it's not a performance disaster (just locally 1 extra ALU port used, not creating false dependencies or big stalls), so if you don't care much about SMont then don't worry about it.

Writing CF instead of leaving it unmodified can potentially be useful with other surrounding code that might benefit from CF dep-breaking, e.g. shifts. See below.

If you want to inc/dec without touching any flags, lea eax, [rax+1] runs efficiently and has the same code-size as add eax, 1. (Usually on fewer possible execution ports than add/inc, though, so add/inc are better when destroying FLAGS is not a problem. https://agner.org/optimize/)


On modern CPUs, add is never slower than inc (except for indirect code-size / decode effects), but usually it's not faster either, so you should prefer inc for code-size reasons. Especially if this choice is repeated many times in the same binary (e.g. if you are a compiler-writer).

inc saves 1 byte (64-bit mode), or 2 bytes (opcodes 0x40..F inc r32/dec r32 short form in 32-bit mode, re-purposed as the REX prefix for x86-64). This makes a small percentage difference in total code size. This helps instruction-cache hit rates, iTLB hit rate, and number of pages that have to be loaded from disk.

Advantages of inc:

  • code-size directly
  • Not using an immediate can have uop-cache effects on Sandybridge-family, which could offset the better micro-fusion of add. (See Agner Fog's table 9.1 in the Sandybridge section of his microarch guide.) Perf counters can easily measure issue-stage uops, but it's harder to measure how things pack into the uop cache and uop-cache read bandwidth effects.
  • Leaving CF unmodified is an advantage in some cases, on CPUs where you can read CF after inc without a stall. (Not on Nehalem and earlier.)

There is one exception among modern CPUs: Silvermont/Goldmont/Knight's Landing decodes inc/dec efficiently as 1 uop, but expands to 2 in the allocate/rename (aka issue) stage. The extra uop merges partial flags. inc throughput is only 1 per clock, vs. 0.5c (or 0.33c Goldmont) for independent add r32, imm8 because of the dep chain created by the flag-merging uops.

Unlike P4, the register result doesn't have a false-dep on flags (see below), so out-of-order execution takes the flag-merging off the latency critical path when nothing uses the flag result. (But the OOO window is much smaller than mainstream CPUs like Haswell or Ryzen.) Running inc as 2 separate uops is probably a win for Silvermont in most cases; most x86 instructions write all the flags without reading them, breaking these flag dependency chains.

SMont/KNL has a queue between decode and allocate/rename (See Intel's optimization manual, figure 16-2) so expanding to 2 uops during issue can fill bubbles from decode stalls (on instructions like one-operand mul, or pshufb, which produce more than 1 uop from the decoder and cause a 3-7 cycle stall for microcode). Or on Silvermont, just an instruction with more than 3 prefixes (including escape bytes and mandatory prefixes), e.g. REX + any SSSE3 or SSE4 instruction. But note that there is a ~28 uop loop buffer, so small loops don't suffer from these decode stalls.

inc/dec aren't the only instructions that decode as 1 but issue as 2: push/pop, call/ret, and lea with 3 components do this too. So do KNL's AVX512 gather instructions. Source: Intel's optimization manual, 17.1.2 Out-of-Order Engine (KNL). It's only a small throughput penalty (and sometimes not even that if anything else is a bigger bottleneck), so it's generally fine to still use inc for "generic" tuning.


Intel's optimization manual still recommends add 1 over inc in general, to avoid risks of partial-flag stalls. But since Intel's compiler doesn't do that by default, it's not too likely that future CPUs will make inc slow in all cases, like P4 did.

Clang 5.0 and Intel's ICC 17 (on Godbolt) do use inc when optimizing for speed (-O3), not just for size. -mtune=pentium4 makes them avoid inc/dec, but the default -mtune=generic doesn't put much weight on P4.

ICC17 -xMIC-AVX512 (equivalent to gcc's -march=knl) does avoid inc, which is probably a good bet in general for Silvermont / KNL. But it's not usually a performance disaster to use inc, so it's probably still appropriate for "generic" tuning to use inc/dec in most code, especially when the flag result isn't part of the critical path.


Other than Silvermont, this is mostly-stale optimization advice left over from Pentium4. On modern CPUs, there's only a problem if you actually read a flag that wasn't written by the last insn that wrote any flags. e.g. in BigInteger adc loops. (And in that case, you need to preserve CF so using add would break your code.)

add writes all the condition-flag bits in the EFLAGS register. Register-renaming makes write-only easy for out-of-order execution: see write-after-write and write-after-read hazards. add eax, 1 and add ecx, 1 can execute in parallel because they are fully independent of each other. (Even Pentium4 renames the condition flag bits separate from the rest of EFLAGS, since even add leaves the interrupts-enabled and many other bits unmodified.)

On P4, inc and dec depend on the previous value of the all the flags, so they can't execute in parallel with each other or preceding flag-setting instructions. (e.g. add eax, [mem] / inc ecx makes the inc wait until after the add, even if the add's load misses in cache.) This is called a false dependency. Partial-flag writes work by reading the old value of the flags, updating the bits other than CF, then writing the full flags.

All other out-of-order x86 CPUs (including AMD's), rename different parts of flags separately, so internally they do a write-only update to all the flags except CF. (source: Agner Fog's microarchitecture guide). Only a few instructions, like adc or cmc, truly read and then write flags. But also shl r, cl (see below).


Cases where add dest, 1 is preferable to inc dest, at least for Intel P6/SnB uarch families:

  • Memory-destination: add [rdi], 1 can micro-fuse the store and the load+add on Intel Core2 and SnB-family, so it's 2 fused-domain uops / 4 unfused-domain uops.
    inc [rdi] can only micro-fuse the store, so it's 3F / 4U.
    According to Agner Fog's tables, AMD and Silvermont run memory-dest inc and add the same, as a single macro-op / uop.

But beware of uop-cache effects with add [label], 1 which needs a 32-bit address and an 8-bit immediate for the same uop.

On Intel SnB-family, variable-count shifts are 3 uops (up from 1 on Core2/Nehalem). AFAICT, two of the uops read/write flags, and an independent uop reads reg and cl, and writes reg. It's a weird case of having better latency (1c + inevitable resource conflicts) than throughput (1.5c), and only being able to achieve max throughput if mixed with instructions that break dependencies on flags. (I posted more about this on Agner Fog's forum). Use BMI2 shlx when possible; it's 1 uop and the count can be in any register.

Anyway, inc (writing flags but leaving CF unmodified) before variable-count shl leaves it with a false dependency on whatever wrote CF last, and on SnB/IvB can require an extra uop to merge flags.

Core2/Nehalem manage to avoid even the false dep on flags: Merom runs a loop of 6 independent shl reg,cl instructions at nearly two shifts per clock, same performance with cl=0 or cl=13. Anything better than 1 per clock proves there's no input-dependency on flags.

I tried loops with shl edx, 2 and shl edx, 0 (immediate-count shifts), but didn't see a speed difference between dec and sub on Core2, HSW, or SKL. I don't know about AMD.

Update: The nice shift performance on Intel P6-family comes at the cost of a large performance pothole which you need to avoid: when an instruction depends on the flag-result of a shift instruction: The front end stalls until the instruction is retired. (Source: Intel's optimization manual, (Section 3.5.2.6: Partial Flag Register Stalls)). So shr eax, 2 / jnz is pretty catastrophic for performance on Intel pre-Sandybridge, I guess! Use shr eax, 2 / test eax,eax / jnz if you care about Nehalem and earlier. Intel's examples makes it clear this applies to immediate-count shifts, not just count=cl.

In processors based on Intel Core microarchitecture [this means Core 2 and later], shift immediate by 1 is handled by special hardware such that it does not experience partial flag stall.

Intel actually means the special opcode with no immediate, which shifts by an implicit 1. I think there is a performance difference between the two ways of encoding shr eax,1, with the short encoding (using the original 8086 opcode D1 /5) producing a write-only (partial) flag result, but the longer encoding (C1 /5, imm8 with an immediate 1) not having its immediate checked for 0 until execution time, but without tracking the flag output in the out-of-order machinery.

Since looping over bits is common, but looping over every 2nd bit (or any other stride) is very uncommon, this seems like a reasonable design choice. This explains why compilers like to test the result of a shift instead of directly using flag results from shr.

Update: for variable count shifts on SnB-family, Intel's optimization manual says:

3.5.1.6 Variable Bit Count Rotation and Shift

In Intel microarchitecture code name Sandy Bridge, The "ROL/ROR/SHL/SHR reg, cl" instruction has three micro-ops. When the flag result is not needed, one of these micro-ops may be discarded, providing better performance in many common usages. When these instructions update partial flag results that are subsequently used, the full three micro-ops flow must go through the execution and retirement pipeline, experiencing slower performance. In Intel microarchitecture code name Ivy Bridge, executing the full three micro-ops flow to use the updated partial flag result has additional delay.

Consider the looped sequence below:

loop:
   shl eax, cl
   add ebx, eax
   dec edx ; DEC does not update carry, causing SHL to execute slower three micro-ops flow
   jnz loop

The DEC instruction does not modify the carry flag. Consequently, the SHL EAX, CL instruction needs to execute the three micro-ops flow in subsequent iterations. The SUB instruction will update all flags. So replacing DEC with SUB will allow SHL EAX, CL to execute the two micro-ops flow.


Terminology

Partial-flag stalls happen when flags are read, if they happen at all. P4 never has partial-flag stalls, because they never need to be merged. It has false dependencies instead.

Several answers / comments mix up the terminology. They describe a false dependency, but then call it a partial-flag stall. It's a slowdown which happens because of writing only some of the flags, but the term "partial-flag stall" is what happens on pre-SnB Intel hardware when partial-flag writes have to be merged. Intel SnB-family CPUs insert an extra uop to merge flags without stalling. Nehalem and earlier stall for ~7 cycles. I'm not sure how big the penalty is on AMD CPUs.

(Note that partial-register penalties are not always the same as partial-flags, see below).

### Partial flag stall on Intel P6-family CPUs:
bigint_loop:
    adc   eax, [array_end + rcx*4]   # partial-flag stall when adc reads CF 
    inc   rcx                        # rcx counts up from negative values towards zero
    # test rcx,rcx  # eliminate partial-flag stalls by writing all flags, or better use add rcx,1
    jnz
# this loop doesn't do anything useful; it's not normally useful to loop the carry-out back to the carry-in for the same accumulator.
# Note that `test` will change the input to the next adc, and so would replacing inc with add 1


In other cases, e.g. a partial flag write followed by a full flag write, or a read of only flags written by inc, is fine. On SnB-family CPUs, inc/dec can even macro-fuse with a jcc, the same as add/sub.

After P4, Intel mostly gave up on trying to get people to re-compile with -mtune=pentium4 or modify hand-written asm as much to avoid serious bottlenecks. (Tuning for a specific microarchitecture will always be a thing, but P4 was unusual in deprecating so many things that used to be fast on previous CPUs, and thus were common in existing binaries.) P4 wanted people to use a RISC-like subset of the x86, and also had branch-prediction hints as prefixes for JCC instructions. (It also had other serious problems, like the trace cache that just wasn't good enough, and weak decoders that meant bad performance on trace-cache misses. Not to mention the whole philosophy of clocking very high ran into the power-density wall.)

When Intel abandoned P4 (NetBurst uarch), they returned to P6-family designs (Pentium-M / Core2 / Nehalem) which inherited their partial-flag / partial-reg handling from earlier P6-family CPUs (PPro to PIII) which pre-dated the netburst mis-step. (Not everything about P4 was inherently bad, and some of the ideas re-appeared in Sandybridge, but overall NetBurst is widely considered a mistake.) Some very-CISC instructions are still slower than the multi-instruction alternatives, e.g. enter, loop, or bt [mem], reg (because the value of reg affects which memory address is used), but these were all slow in older CPUs so compilers already avoided them.

Pentium-M even improved hardware support for partial-regs (lower merging penalties). In Sandybridge, Intel kept partial-flag and partial-reg renaming and made it much more efficient when merging is needed (merging uop inserted with no or minimal stall). SnB made major internal changes and is considered a new uarch family, even though it inherits a lot from Nehalem, and some ideas from P4. (But note that SnB's decoded-uop cache is not a trace cache, though, so it's a very different solution to the decoder throughput/power problem that NetBurst's trace cache tried to solve.)


For example, inc al and inc ah can run in parallel on P6/SnB-family CPUs, but reading eax afterwards requires merging.

PPro/PIII stall for 5-6 cycles when reading the full reg. Core2/Nehalem stall for only 2 or 3 cycles while inserting a merging uop for partial regs, but partial flags are still a longer stall.

SnB inserts a merging uop without stalling, like for flags. Intel's optimization guide says that for merging AH/BH/CH/DH into the wider reg, inserting the merging uop takes an entire issue/rename cycle during which no other uops can be allocated. But for low8/low16, the merging uop is "part of the flow", so it apparently doesn't cause additional front-end throughput penalties beyond taking up one of the 4 slots in an issue/rename cycle.

In IvyBridge (or at least Haswell), Intel dropped partial-register renaming for low8 and low16 registers, keeping it only for high8 registers (AH/BH/CH/DH). Reading high8 registers has extra latency. Also, setcc al has a false dependency on the old value of rax, unlike in Nehalem and earlier (and probably Sandybridge). See this HSW/SKL partial-register performance Q&A for the details.

(I've previously claimed that Haswell could merge AH with no uop, but that's not true and not what Agner Fog's guide says. I skimmed too quickly and unfortunately repeated my wrong understanding in lots of comments and other posts.)

AMD CPUs, and Intel Silvermont, don't rename partial regs (other than flags), so mov al, [mem] has a false dependency on the old value of eax. (The upside is no partial-reg merging slowdowns when reading the full reg later.)


Normally, the only time add instead of inc will make your code faster on AMD or mainstream Intel is when your code actually depends on the doesn't-touch-CF behaviour of inc. i.e. usually add only helps when it would break your code, but note the shl case mentioned above, where the instruction reads flags but usually your code doesn't care about that, so it's a false dependency.

If you do actually want to leave CF unmodified, pre SnB-family CPUs have serious problems with partial-flag stalls, but on SnB-family the overhead of having the CPU merge the partial flags is very low, so it can be best to keep using inc or dec as part of a loop condition when targeting those CPU, with some unrolling. (For details, see the BigInteger adc Q&A I linked earlier). It can be useful to use lea to do arithmetic without affecting flags at all, if you don't need to branch on the result.


Skylake doesn't have partial-flag merging costs

Update: Skylake doesn't have partial-flag merging uops at all: CF is just a separate register from the rest of FLAGS. Instructions that need both parts (like cmovbe) read both inputs separately. That makes cmovbe a 2-uop instruction, but most other cmovcc instructions 1-uop on Skylake. See What is a Partial Flag Stall?.

adc only reads CF so it can be single-uop on Skylake with no interaction at all with an inc or dec in the same loop.

(TODO: rewrite earlier parts of this answer.)

这篇关于INC 指令与 ADD 1:重要吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆