在 x86 程序集中将寄存器设置为零的最佳方法是什么:xor、mov 或 and? [英] What is the best way to set a register to zero in x86 assembly: xor, mov or and?

查看:24
本文介绍了在 x86 程序集中将寄存器设置为零的最佳方法是什么:xor、mov 或 and?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

以下所有指令都做同样的事情:将 %eax 设置为零.哪种方式是最佳的(需要最少的机器周期)?

xorl %eax, %eax移动 $0, %eax还有 $0, %eax

解决方案

TL;DR 总结:xor same, same 是所有 CPU 的最佳选择.没有其他方法比它有任何优势,它至少比任何其他方法都有一些优势.它是 Intel 和 AMD 官方推荐的,以及编译器的作用.在64位模式下,仍然使用xor r32, r32,因为写一个 32 位 reg 将高位 32 置零.xor r64, r64 浪费一个字节,因为它需要一个REX前缀.

比这更糟糕的是,Silvermont 仅将 xor r32,r32 识别为 dep-breaking,而不是 64 位操作数大小.因此即使由于您将 r8..r15 归零,仍然需要 REX 前缀,请使用 xor r10d,r10d,而不是 xor r10,r10.

GP-整数示例:

异或 eax, eax ;RAX = 0.包括 AL=0 等.异或 r10d, r10d ;R10 = 0.仍然喜欢 32 位操作数大小.异或 edx, edx ;RDX = 0;小代码替代: cdq ;如果 EAX 已经为零,则 RDX 为零;次优异或 rax,rax ;浪费了 REX 前缀,并且在 Silvermont 上特别慢异或 r10,r10 ;Silvermont 上很糟糕(不是 dep 中断),与其他 CPU 上的 r10d 相同,因为 r10d 或 r10 仍然需要 REX 前缀.移动 eax, 0 ;不接触 FLAGS,但速度不快,占用更多字节和 eax, 0 ;虚假依赖.(微基准实验可能需要这个)子 eax, eax ;在大多数但不是所有 CPU 上与 xor 相同;例如在 Silvermont 不好.异或 cl, cl ;false 依赖于某些 CPU,而不是归零习语.使用异或 ecx,ecxmov cl, 0 ;只有 2 个字节,并且可能比 xor cl,cl *if* 你需要保留 ECX/RCX 的其余部分不变


将向量寄存器清零通常最好使用 pxor xmm, xmm.这通常是 gcc 所做的(甚至在使用 FP 指令之前).

xorps xmm, xmm 是有道理的.它比 pxor 短一个字节,但是 xorps 在 Intel Nehalem 上需要执行端口 5,而 pxor 可以在任何端口(0/1/5).(Nehalem 在整数和 FP 之间的 2c 旁路延迟延迟通常不相关,因为乱序执行通常可以将其隐藏在新依赖链的开始处).

在 SnB 系列微架构上,异或归零的风格甚至都不需要执行端口.在 AMD 和 Nehalem P6/Core2 之前的 Intel 上,xorpspxor 的处理方式相同(作为向量整数指令).

使用 128b 向量指令的 AVX 版本也将 reg 的上部归零,因此 vpxor xmm, xmm, xmm 是对 YMM(AVX1/AVX2) 或 ZMM 归零的不错选择(AVX512) 或任何未来的矢量扩展.vpxor ymm, ymm, ymm 不需要任何额外的字节来编码,但在 Intel 上运行相同,但在 Zen2 (2 uops) 之前的 AMD 上速度较慢.AVX512 ZMM 归零需要额外的字节(对于 EVEX 前缀),因此应首选 XMM 或 YMM 归零.

XMM/YMM/ZMM 示例

 # 好:xorps xmm0, xmm0 ;最小代码大小(对于非 AVX)像素或 xmm0, xmm0 ;花费一个额外的字节,在 Nehalem 上的任何端口上运行.xorps xmm15, xmm15 ;需要一个 REX 前缀,但如果您需要在没有 AVX 的情况下使用高寄存器,这是不可避免的.代码大小是唯一的惩罚.# 适合 AVX:vpx 或 xmm0, xmm0, xmm0 ;零点 X/Y/ZMM0vpx 或 xmm15, xmm0, xmm0 ;零 X/Y/ZMM15,仍然只有 2 字节的 VEX 前缀#次优AVXvpx 或 xmm15, xmm15, xmm15 ;3 字节 VEX 前缀,因为高源 regvpx 或 ymm0, ymm0, ymm0 ;Zen2 之前在 AMD 上解码为 2 uops# 适合 AVX512vpx 或 xmm15, xmm0, xmm0 ;使用 AVX1 编码指令(2 字节 VEX 前缀)将 ZMM15 归零.vpxord xmm30, xmm30, xmm30 ;将 zmm16..31 归零时,EVEX 是不可避免的,但仍然更喜欢 XMM 或 YMM,以便在可能的未来 AMD 上减少 uops.可能值得只使用高 regs 以避免在短函数中需要 vzeroupper.# 适合 AVX512 * 不带 * AVX512VL(例如 KNL/Xeon Phi)vpxord zmm30, zmm30, zmm30 ;如果没有 AVX512VL,您必须使用 512 位指令.# 使用 AVX512 次优(即使没有 AVX512VL)vpxord zmm0, zmm0, zmm0 ;EVEX 前缀(4 字节)和 512 位 uop.甚至在 KNL 上使用 AVX1 vpx 或 xmm0、xmm0、xmm0 以节省代码大小.

参见 使用 xmm 寄存器的 AMD Jaguar/Bulldozer/Zen 上的 vxorps 归零是否比 ymm 更快?
在 Knights Landing 上清除单个或几个 ZMM 寄存器的最有效方法是什么?

半相关:设置 __m256 的最快方法所有 ONE 位的值
有效地将 CPU 寄存器中的所有位设置为 1 还包括 AVX512 k0..7 掩码寄存器.SSE/AVX vpcmpeqd 在很多方面都被破坏了(尽管仍然需要一个 uop 来编写 1s),但是用于 ZMM regs 的 AVX512 vpternlogd 甚至都不是 dep-breaking.在循环内部考虑从另一个寄存器复制而不是使用 ALU uop 重新创建寄存器,尤其是使用 AVX512.

但归零很便宜:在循环内对 xmm reg 进行异或归零通常与复制一样好,除了在某些 AMD CPU(推土机和 Zen)上,它们对向量 reg 进行了移动消除,但仍需要 ALU uop 来写入异或归零的零.


在各种 uarches 上将 xor 等习语归零有什么特别之处

某些 CPU 将 sub same,same 识别为类似于 xor 的归零习语,但是所有识别任何归零习语的 CPU 都识别 xor.只需使用 xor 这样您就不必担心哪个 CPU 识别哪个归零习惯用法.

xor(作为公认的归零习惯用法,与 mov reg, 0 不同)有一些明显和一些微妙的优势(总结列表,然后我将扩展这些):

  • mov reg,0 更小的代码大小.(所有 CPU)
  • 避免了对后续代码的部分注册惩罚.(英特尔 P6 系列和 SnB 系列).
  • 不使用执行单元,从而节省电力并释放执行资源.(英特尔 SnB 系列)
  • 较小的 uop(无即时数据)在 uop 缓存行中留出空间,以便在需要时借用附近的指令.(英特尔 SnB 系列).
  • 不会用完物理寄存器文件中的条目.(至少英特尔 SnB 系列(和 P4),也可能是 AMD,因为它们使用类似的 PRF 设计,而不是像英特尔 P6 系列微架构那样在 ROB 中保持寄存器状态.)

更小的机器代码大小(2 个字节而不是 5 个字节)始终是一个优势:更高的代码密度导致更少的指令缓存未命中,以及更好的指令获取和潜在的解码带宽.


在英特尔 SnB 系列微架构上不使用执行单元进行异或的好处很小,但可以节省电量.在只有 3 个 ALU 执行端口的 SnB 或 IvB 上更有可能重要.Haswell 和更高版本有 4 个可以处理整数 ALU 指令的执行端口,包括 mov r32, imm32,因此通过调度程序的完美决策(这在实践中并不总是发生),HSW 仍然可以即使它们都需要 ALU 执行端口,也能维持每个时钟 4 uop.

有关更多详细信息,请参阅我对另一个有关清零寄存器的问题的回答.

Bruce Dawson 的博文 Michael Petch 链接(在对该问题的评论中)指出 xor 在寄存器重命名阶段处理,不需要执行单元(未融合域中的零 uops),但错过了事实上,它仍然是融合域中的一个 uop.现代英特尔 CPU 可以发出 &每个时钟停用 4 个融合域 uops.这就是每个时钟限制 4 个零的来源.寄存器重命名硬件的复杂性增加只是将设计宽度限制为 4 的原因之一.(Bruce 写了一些非常出色的博客文章,比如他关于 FP 数学和 x87/SSE/舍入问题,我强烈推荐).>


在 AMD Bulldozer 系列 CPU 上,mov 立即 在与 xor 相同的 EX0/EX1 整数执行端口上运行.mov reg,reg 也可以在 AGU0/1 上运行,但这仅用于寄存器复制,不适用于立即数设置.所以 AFAIK,在 AMD 上 xormov 的唯一优势是编码更短.它也可能节省物理寄存器资源,但我没有看到任何测试.


已识别的归零习惯用法避免部分寄存器惩罚,英特尔 CPU 将部分寄存器与完整寄存器分开重命名(P6 和 SnB 系列).

xor 会将寄存器标记为高位为零,所以 xor eax, eax/inc al/inc eax 避免了前 IvB CPU 所具有的通常的部分寄存器惩罚.即使没有xor,IvB也只需要在修改高8位(AH)然后读取整个寄存器时进行合并uop,Haswell甚至将其删除.

来自 Agner Fog 的微架构指南,第 98 页(Pentium M 部分,包括 SnB 在内的后续部分引用):

<块引用>

处理器将寄存器与自身的异或识别为设置它为零.寄存器中的特殊标签会记住高部分寄存器的值为零,因此 EAX = AL.这个标签甚至被记住循环:

 ;例 7.9.循环中避免了部分寄存器问题异或 eax, eaxmov ecx, 100二:mov al, [esi]mov [edi], eax ;没有额外的uop公司添加 edi, 4十二月林俊杰

<块引用>

(来自 pg82):处理器记住 EAX 的高 24 位为零,只要您不会收到中断、错误预测或其他序列化事件.

该指南的

pg82 还确认 mov reg, 0 不被识别为归零习惯用法,至少在 PIII 或 PM 等早期 P6 设计中是这样.如果他们在后来的 CPU 上使用晶体管来检测它,我会感到非常惊讶.


xor 设置标志,这意味着您在测试条件时必须小心.由于 setcc 遗憾的是仅适用于 8 位目标,因此您通常需要注意避免部分寄存器惩罚.

如果 x86-64 将已删除的操作码之一(如 AAM)重新用于 16/32/64 位 setcc r/m,并将谓词编码在源代码中,那就太好了-register r/m 字段的 3 位字段(其他一些单操作数指令将它们用作操作码位的方式).但他们没有这样做,无论如何这对 x86-32 无济于事.

理想情况下,您应该使用 xor/set flags/setcc/读取完整寄存器:

<代码>...调用 some_func异或 ecx,ecx ;零*之前*测试测试 eax,eaxsetnz cl ;cl = (some_func() != 0)添加 ebx, ecx ;这里没有部分注册惩罚

这在所有 CPU 上都具有最佳性能(没有停顿、合并 uop 或错误依赖).

当您不想在标志设置指令之前进行异或时,事情会变得更加复杂.例如你想在一个条件下分支,然后在另一个条件下从相同的标志 setcc .例如cmp/jlesete,并且您要么没有备用寄存器,要么希望将 xor 排除在非-完全采用代码路径.

没有不影响标志的公认归零习惯用法,因此最佳选择取决于目标微体系结构.在 Core2 上,插入合并 uop 可能会导致 2 或 3 个周期的停顿.它在 SnB 上似乎更便宜,但我没有花太多时间尝试测量.使用 mov reg, 0/setcc 会对较旧的 Intel CPU 产生显着的影响,但在较新的 Intel CPU 上仍然会更糟.

使用 setcc/movzx r32, r8 可能是 Intel P6 & 的最佳选择SnB 系列,如果您不能在标志设置指令之前进行异或为零.这应该比在异或归零后重复测试要好.(甚至不要考虑 sahf/lahfpushf/popf).IvB 可以消除 movzx r32, r8(即通过寄存器重命名处理它,没有执行单元或延迟,如异或归零).Haswell 和后来只消除了常规的 mov 指令,所以 movzx 需要一个执行单元并且具有非零延迟,使得 test/setcc/movzxxor/test/setcc 差,但至少和 test/mov r,0/setcc(在旧 CPU 上效果更好).

在 AMD/P4/Silvermont 上使用 setcc/movzx 而不先归零是不好的,因为它们不会单独跟踪子寄存器的 deps.对寄存器的旧值会有一个错误的依赖.当 xor/test/setcc 使用 mov reg, 0/setcc 进行归零/依赖破坏可能是最好的选择代码> 不是一个选项.

当然,如果您不需要 setcc 的输出宽度超过 8 位,则不需要将任何内容归零.但是,如果您选择的寄存器最近是长依赖链的一部分,请注意对 P6/SnB 以外的 CPU 的错误依赖.(如果您调用的函数可能会保存/恢复您正在使用的寄存器的一部分,请注意导致部分 reg 停顿或额外的 uop.)


and 立即为零 不是特殊情况,独立于我所知道的任何 CPU 上的旧值,因此它不会破坏依赖性链.它与 xor 相比没有任何优势,但有许多缺点.

仅当您希望将依赖项作为延迟测试的一部分,但希望通过归零和添加来创建已知值时,它才对编写微基准很有用.


有关微架构的详细信息,请参阅http://agner.org/optimize/,包括哪些归零习语被识别为依赖关系破坏(例如,sub same,same 在某些但不是所有 CPU 上,而 xor same,same 在所有 CPU 上都被识别.) mov 确实打破了对寄存器旧值的依赖链(无论源值如何,零与否,因为 mov 就是这样工作的).xor 只在 src 和 dest 是同一个寄存器的特殊情况下打破依赖链,这就是为什么 mov 被排除在 specially 公认的依赖破坏者.(另外,因为它不被认为是归零习语,还有其他好处.)

有趣的是,最古老的 P6 设计(PPro 到 Pentium III)没有xor-zeroing 识别为依赖关系破坏者,仅作为归零习语为了避免部分寄存器停顿,因此在某些情况下,值得使用 both movxor-zeroing为了打破dep然后再次归零+设置高位为零的内部标记位,因此EAX = AX = AL.

参见 Agner Fog 的示例 6.17.在他的微拱pdf中.他说这也适用于 P2、P3,甚至(早期?)PM.对链接的博客文章 说只有 PPro 有这种疏忽,但我已经在 Katmai PIII 上进行了测试,@Fanael 在 Pentium M 上进行了测试,我们都发现它没有破坏延迟的依赖性-bound imul 链.不幸的是,这证实了 Agner Fog 的结果.


TL:DR:

如果它确实使您的代码更好或节省了指令,那么可以肯定,将 mov 归零以避免触及标志,只要您不引入代码大小以外的性能问题.避免破坏标志是不使用 xor 的唯一合理原因,但有时如果您有备用寄存器,您可以在设置标志的事情之前进行异或零.

mov-zero 在 setcc 之前比 movzx reg32, reg8 之后的延迟更好(英特尔除外,当您可以选择不同的寄存器时),但更糟糕的代码大小.

All the following instructions do the same thing: set %eax to zero. Which way is optimal (requiring fewest machine cycles)?

xorl   %eax, %eax
mov    $0, %eax
andl   $0, %eax

解决方案

TL;DR summary: xor same, same is the best choice for all CPUs. No other method has any advantage over it, and it has at least some advantage over any other method. It's officially recommended by Intel and AMD, and what compilers do. In 64-bit mode, still use xor r32, r32, because writing a 32-bit reg zeros the upper 32. xor r64, r64 is a waste of a byte, because it needs a REX prefix.

Even worse than that, Silvermont only recognizes xor r32,r32 as dep-breaking, not 64-bit operand-size. Thus even when a REX prefix is still required because you're zeroing r8..r15, use xor r10d,r10d, not xor r10,r10.

GP-integer examples:

xor   eax, eax       ; RAX = 0.  Including AL=0 etc.
xor   r10d, r10d     ; R10 = 0.  Still prefer 32-bit operand-size.

xor   edx, edx       ; RDX = 0
 ; small code-size alternative:    cdq    ; zero RDX if EAX is already zero

; SUB-OPTIMAL
xor   rax,rax       ; waste of a REX prefix, and extra slow on Silvermont
xor   r10,r10       ; bad on Silvermont (not dep breaking), same as r10d on other CPUs because a REX prefix is still needed for r10d or r10.
mov   eax, 0        ; doesn't touch FLAGS, but not faster and takes more bytes
 and   eax, 0        ; false dependency.  (Microbenchmark experiments might want this)
 sub   eax, eax      ; same as xor on most but not all CPUs; bad on Silvermont for example.

xor   cl, cl        ; false dep on some CPUs, not a zeroing idiom.  Use xor ecx,ecx
mov   cl, 0         ; only 2 bytes, and probably better than xor cl,cl *if* you need to leave the rest of ECX/RCX unmodified


Zeroing a vector register is usually best done with pxor xmm, xmm. That's typically what gcc does (even before use with FP instructions).

xorps xmm, xmm can make sense. It's one byte shorter than pxor, but xorps needs execution port 5 on Intel Nehalem, while pxor can run on any port (0/1/5). (Nehalem's 2c bypass delay latency between integer and FP is usually not relevant, because out-of-order execution can typically hide it at the start of a new dependency chain).

On SnB-family microarchitectures, neither flavour of xor-zeroing even needs an execution port. On AMD, and pre-Nehalem P6/Core2 Intel, xorps and pxor are handled the same way (as vector-integer instructions).

Using the AVX version of a 128b vector instruction zeros the upper part of the reg as well, so vpxor xmm, xmm, xmm is a good choice for zeroing YMM(AVX1/AVX2) or ZMM(AVX512), or any future vector extension. vpxor ymm, ymm, ymm doesn't take any extra bytes to encode, though, and runs the same on Intel, but slower on AMD before Zen2 (2 uops). The AVX512 ZMM zeroing would require extra bytes (for the EVEX prefix), so XMM or YMM zeroing should be preferred.

XMM/YMM/ZMM examples

    # Good:
 xorps   xmm0, xmm0         ; smallest code size (for non-AVX)
 pxor    xmm0, xmm0         ; costs an extra byte, runs on any port on Nehalem.
 xorps   xmm15, xmm15       ; Needs a REX prefix but that's unavoidable if you need to use high registers without AVX.  Code-size is the only penalty.

   # Good with AVX:
 vpxor xmm0, xmm0, xmm0    ; zeros X/Y/ZMM0
 vpxor xmm15, xmm0, xmm0   ; zeros X/Y/ZMM15, still only 2-byte VEX prefix

#sub-optimal AVX
 vpxor xmm15, xmm15, xmm15  ; 3-byte VEX prefix because of high source reg
 vpxor ymm0, ymm0, ymm0     ; decodes to 2 uops on AMD before Zen2


    # Good with AVX512
 vpxor  xmm15,  xmm0, xmm0     ; zero ZMM15 using an AVX1-encoded instruction (2-byte VEX prefix).
 vpxord xmm30, xmm30, xmm30    ; EVEX is unavoidable when zeroing zmm16..31, but still prefer XMM or YMM for fewer uops on probable future AMD.  May be worth using only high regs to avoid needing vzeroupper in short functions.
    # Good with AVX512 *without* AVX512VL (e.g. KNL / Xeon Phi)
 vpxord zmm30, zmm30, zmm30    ; Without AVX512VL you have to use a 512-bit instruction.

# sub-optimal with AVX512 (even without AVX512VL)
 vpxord  zmm0, zmm0, zmm0      ; EVEX prefix (4 bytes), and a 512-bit uop.  Use AVX1 vpxor xmm0, xmm0, xmm0 even on KNL to save code size.

See Is vxorps-zeroing on AMD Jaguar/Bulldozer/Zen faster with xmm registers than ymm? and
What is the most efficient way to clear a single or a few ZMM registers on Knights Landing?

Semi-related: Fastest way to set __m256 value to all ONE bits and
Set all bits in CPU register to 1 efficiently also covers AVX512 k0..7 mask registers. SSE/AVX vpcmpeqd is dep-breaking on many (although still needs a uop to write the 1s), but AVX512 vpternlogd for ZMM regs isn't even dep-breaking. Inside a loop consider copying from another register instead of re-creating ones with an ALU uop, especially with AVX512.

But zeroing is cheap: xor-zeroing an xmm reg inside a loop is usually as good as copying, except on some AMD CPUs (Bulldozer and Zen) which have mov-elimination for vector regs but still need an ALU uop to write zeros for xor-zeroing.


What's special about zeroing idioms like xor on various uarches

Some CPUs recognize sub same,same as a zeroing idiom like xor, but all CPUs that recognize any zeroing idioms recognize xor. Just use xor so you don't have to worry about which CPU recognizes which zeroing idiom.

xor (being a recognized zeroing idiom, unlike mov reg, 0) has some obvious and some subtle advantages (summary list, then I'll expand on those):

  • smaller code-size than mov reg,0. (All CPUs)
  • avoids partial-register penalties for later code. (Intel P6-family and SnB-family).
  • doesn't use an execution unit, saving power and freeing up execution resources. (Intel SnB-family)
  • smaller uop (no immediate data) leaves room in the uop cache-line for nearby instructions to borrow if needed. (Intel SnB-family).
  • doesn't use up entries in the physical register file. (Intel SnB-family (and P4) at least, possibly AMD as well since they use a similar PRF design instead of keeping register state in the ROB like Intel P6-family microarchitectures.)

Smaller machine-code size (2 bytes instead of 5) is always an advantage: Higher code density leads to fewer instruction-cache misses, and better instruction fetch and potentially decode bandwidth.


The benefit of not using an execution unit for xor on Intel SnB-family microarchitectures is minor, but saves power. It's more likely to matter on SnB or IvB, which only have 3 ALU execution ports. Haswell and later have 4 execution ports that can handle integer ALU instructions, including mov r32, imm32, so with perfect decision-making by the scheduler (which doesn't always happen in practice), HSW could still sustain 4 uops per clock even when they all need ALU execution ports.

See my answer on another question about zeroing registers for some more details.

Bruce Dawson's blog post that Michael Petch linked (in a comment on the question) points out that xor is handled at the register-rename stage without needing an execution unit (zero uops in the unfused domain), but missed the fact that it's still one uop in the fused domain. Modern Intel CPUs can issue & retire 4 fused-domain uops per clock. That's where the 4 zeros per clock limit comes from. Increased complexity of the register renaming hardware is only one of the reasons for limiting the width of the design to 4. (Bruce has written some very excellent blog posts, like his series on FP math and x87 / SSE / rounding issues, which I do highly recommend).


On AMD Bulldozer-family CPUs, mov immediate runs on the same EX0/EX1 integer execution ports as xor. mov reg,reg can also run on AGU0/1, but that's only for register copying, not for setting from immediates. So AFAIK, on AMD the only advantage to xor over mov is the shorter encoding. It might also save physical register resources, but I haven't seen any tests.


Recognized zeroing idioms avoid partial-register penalties on Intel CPUs which rename partial registers separately from full registers (P6 & SnB families).

xor will tag the register as having the upper parts zeroed, so xor eax, eax / inc al / inc eax avoids the usual partial-register penalty that pre-IvB CPUs have. Even without xor, IvB only needs a merging uop when the high 8bits (AH) are modified and then the whole register is read, and Haswell even removes that.

From Agner Fog's microarch guide, pg 98 (Pentium M section, referenced by later sections including SnB):

The processor recognizes the XOR of a register with itself as setting it to zero. A special tag in the register remembers that the high part of the register is zero so that EAX = AL. This tag is remembered even in a loop:

    ; Example    7.9. Partial register problem avoided in loop
    xor    eax, eax
    mov    ecx, 100
LL:
    mov    al, [esi]
    mov    [edi], eax    ; No extra uop
    inc    esi
    add    edi, 4
    dec    ecx
    jnz    LL

(from pg82): The processor remembers that the upper 24 bits of EAX are zero as long as you don't get an interrupt, misprediction, or other serializing event.

pg82 of that guide also confirms that mov reg, 0 is not recognized as a zeroing idiom, at least on early P6 designs like PIII or PM. I'd be very surprised if they spent transistors on detecting it on later CPUs.


xor sets flags, which means you have to be careful when testing conditions. Since setcc is unfortunately only available with an 8bit destination, you usually need to take care to avoid partial-register penalties.

It would have been nice if x86-64 repurposed one of the removed opcodes (like AAM) for a 16/32/64 bit setcc r/m, with the predicate encoded in the source-register 3-bit field of the r/m field (the way some other single-operand instructions use them as opcode bits). But they didn't do that, and that wouldn't help for x86-32 anyway.

Ideally, you should use xor / set flags / setcc / read full register:

...
call  some_func
xor     ecx,ecx    ; zero *before* the test
test    eax,eax
setnz   cl         ; cl = (some_func() != 0)
add     ebx, ecx   ; no partial-register penalty here

This has optimal performance on all CPUs (no stalls, merging uops, or false dependencies).

Things are more complicated when you don't want to xor before a flag-setting instruction. e.g. you want to branch on one condition and then setcc on another condition from the same flags. e.g. cmp/jle, sete, and you either don't have a spare register, or you want to keep the xor out of the not-taken code path altogether.

There are no recognized zeroing idioms that don't affect flags, so the best choice depends on the target microarchitecture. On Core2, inserting a merging uop might cause a 2 or 3 cycle stall. It appears to be cheaper on SnB, but I didn't spend much time trying to measure. Using mov reg, 0 / setcc would have a significant penalty on older Intel CPUs, and still be somewhat worse on newer Intel.

Using setcc / movzx r32, r8 is probably the best alternative for Intel P6 & SnB families, if you can't xor-zero ahead of the flag-setting instruction. That should be better than repeating the test after an xor-zeroing. (Don't even consider sahf / lahf or pushf / popf). IvB can eliminate movzx r32, r8 (i.e. handle it with register-renaming with no execution unit or latency, like xor-zeroing). Haswell and later only eliminate regular mov instructions, so movzx takes an execution unit and has non-zero latency, making test/setcc/movzx worse than xor/test/setcc, but still at least as good as test/mov r,0/setcc (and much better on older CPUs).

Using setcc / movzx with no zeroing first is bad on AMD/P4/Silvermont, because they don't track deps separately for sub-registers. There would be a false dep on the old value of the register. Using mov reg, 0/setcc for zeroing / dependency-breaking is probably the best alternative when xor/test/setcc isn't an option.

Of course, if you don't need setcc's output to be wider than 8 bits, you don't need to zero anything. However, beware of false dependencies on CPUs other than P6 / SnB if you pick a register that was recently part of a long dependency chain. (And beware of causing a partial reg stall or extra uop if you call a function that might save/restore the register you're using part of.)


and with an immediate zero isn't special-cased as independent of the old value on any CPUs I'm aware of, so it doesn't break dependency chains. It has no advantages over xor and many disadvantages.

It's useful only for writing microbenchmarks when you want a dependency as part of a latency test, but want to create a known value by zeroing and adding.


See http://agner.org/optimize/ for microarch details, including which zeroing idioms are recognized as dependency breaking (e.g. sub same,same is on some but not all CPUs, while xor same,same is recognized on all.) mov does break the dependency chain on the old value of the register (regardless of the source value, zero or not, because that's how mov works). xor only breaks dependency chains in the special-case where src and dest are the same register, which is why mov is left out of the list of specially recognized dependency-breakers. (Also, because it's not recognized as a zeroing idiom, with the other benefits that carries.)

Interestingly, the oldest P6 design (PPro through Pentium III) didn't recognize xor-zeroing as a dependency-breaker, only as a zeroing idiom for the purposes of avoiding partial-register stalls, so in some cases it was worth using both mov and then xor-zeroing in that order to break the dep and then zero again + set the internal tag bit that the high bits are zero so EAX=AX=AL.

See Agner Fog's Example 6.17. in his microarch pdf. He says this also applies to P2, P3, and even (early?) PM. A comment on the linked blog post says it was only PPro that had this oversight, but I've tested on Katmai PIII, and @Fanael tested on a Pentium M, and we both found that it didn't break a dependency for a latency-bound imul chain. This confirms Agner Fog's results, unfortunately.


TL:DR:

If it really makes your code nicer or saves instructions, then sure, zero with mov to avoid touching the flags, as long as you don't introduce a performance problem other than code size. Avoiding clobbering flags is the only sensible reason for not using xor, but sometimes you can xor-zero ahead of the thing that sets flags if you have a spare register.

mov-zero ahead of setcc is better for latency than movzx reg32, reg8 after (except on Intel when you can pick different registers), but worse code size.

这篇关于在 x86 程序集中将寄存器设置为零的最佳方法是什么:xor、mov 或 and?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆