为什么gcc和clang生成mov reg,-1 [英] Why do gcc and clang generate mov reg,-1
问题描述
我正在使用编译器资源管理器查看gcc和clang的一些输出,以了解这些编译器为某些代码发出的汇编代码.最近,我查看了这段代码的输出.
I am using compiler explorer to look at some outputs from gcc and clang to get an idea of what assembly these compilers emit for some code. Recently I looked at the output of this code.
int compare_int64(int64_t left, int64_t right)
{
return (left < right) ? -1 : (left > right) ? 1 : 0;
}
本练习的重点不是针对C ++,无论如何都可以内联此代码,而是在调用此类函数时.
The point of this exercise is not for C++ where this code might be inlined anyway but when such functions are being called.
使用-O3,这是输出:
With -O3 this is the output:
c声:
xor ecx, ecx
cmp rdi, rsi
setg cl
mov eax, -1
cmovge eax, ecx
ret
gcc:
xor eax, eax
cmp rdi, rsi
mov edx, -1
setg al
cmovl eax, edx
ret
我注意到该代码的大小为17个字节,仅比一个漂亮的16个字节(仅1个字节)(我正在使用的另一个非C ++编译器中x64的默认代码对齐为16).对于所示的gcc代码,我正在考虑使用 lea edx,[eax-1]
或或edx,-1
(在 cmp
当然)以减少代码大小.有趣的是,在使用-Os时,gcc插入了一条 jl
指令,这对该函数的性能而言确实是灾难性的.
I noticed this code is 17 bytes in size which is just 1 byte over a nice 16 byte (the default code alignment for x64 in another non-C++ compiler I am using is 16). For the gcc code shown I was thinking of either using lea edx,[eax-1]
or or edx,-1
(latter before the cmp
of course) to reduce the code size. Interestingly when using -Os gcc inserts a jl
instruction which is kinda disastrous for the performance of that function.
我不是专家,请查阅Agner Fog的说明表手册,如果我没有误会 mov
, lea
和 or
时间/等待时间都相等.
I am no expert and looked into the instruction tables manual by Agner Fog and if I am not mistaken for mov
, lea
and or
the timings/latency are equal.
因此,实际问题是:为什么两个编译器都使用5字节大小的指令而不是较短的3或4字节指令?用 lea reg,[eax-1]
或 or reg,-1
替换 mov reg,-1
是否无害?/p>
So the actual question(s):
Why do both compilers use a 5byte size instruction instead of a shorter 3- or 4-byte instruction?
Would it be harmless to replace the mov reg,-1
with lea reg,[eax-1]
or or reg,-1
?
推荐答案
优化速度 mov reg时,使用-1
代替 or -1,因为前者将寄存器用作只写"寄存器.CPU知道的操作数,并使用该操作数有效地对其进行调度(乱序).尽管
或reg,-1
始终会产生 -1
,但CPU仍未将其识别为依赖项中断(仅写)指令.
When optimizing for speed mov reg, -1
is used instead of or reg, -1
because the former uses the register as a "write-only" operand, which CPU knows about and uses that to schedule it efficiently (out of order). Whereas or reg, -1
, even though will always produce -1
is not recognized by the CPU as a dependency-breaking (write only) instruction.
说明它如何影响性能:
mov eax, [rdi] # Imagine a cache-miss here.
mov [rsi], eax
mov eax, -1 # `mov eax, -1` is able to dispatch and execute without waiting
# for the cache-miss to be served.
add eax, edx # `add eax, edx` only needs to wait 1 cycle for `mov` to
# complete (assuming `edx` is ready) and then it can
# dispatch while cache-miss load from a few lines above
# is still in progress.
现在这段代码:
mov eax, [rdi] # Imagine a cache-miss here.
mov [rsi], eax
or eax, -1 # Now this instruction has to wait for the cache-miss
# load to complete.
add eax, edx # And this one will be waiting too.
(示例适用于当前的任何x86-64 CPU,例如Skylake/Ice Lake/Zen).
(Example applies for any current x86-64 CPU, such as Skylake/Ice Lake/Zen).
如果您正在汇编中编写代码,并确定寄存器不属于当前正在进行的依赖关系链,则可以使用或reg,-1
,它将具有没有负面影响(当然,如果您的假设正确).
If you're writing the code in assembly and certain that a register is not part of a dependency-chain that's currently in progress, you can use or reg, -1
and it'll have no negative effect (if your assumptions are right, of course).
由于存在偶然附加到依赖链的危险,因此编译器通常在优化速度时通常不使用或reg,-1
来生成-1.
Because of that danger of accidentally attaching to a dependency chain compilers do not generally use or reg, -1
for producing -1 when optimizing for speed.
当我们需要一个零而不是-1时,我们很幸运,因为CPU可以识别成语,例如 xor reg,reg
和 sub reg,reg 代码>.它们的代码较小,CPU识别出计算结果与寄存器无关(始终为零).
When we need a zero, instead of -1, we're in luck because there are idioms that CPUs recognize, for example xor reg, reg
and sub reg, reg
. They're smaller in code size, and CPUs recognize that the result of the computation does not depend on the register (always zero).
这些零成语除了代码量较小之外,通常还由CPU的前端部分处理,因此取决于结果的指令将可以立即分派.
These zero-idioms, on top of being smaller in code size, are also usually processed by the front-end part of the CPU, so the instructions depending on the result will immediately be able to dispatch.
零惯用语也适用于矢量寄存器: vpxor xmm0,xmm0,xmm0
(产生零,不依赖于xmm0的先前值和零延迟).有趣的是向量寄存器也有一个-1习惯用语,即 vpcmpeqd xmm0,xmm0,xmm0
-该寄存器被认为是只写的(将值与自身进行比较将始终为true),但至少在SKL/Zen CPU上,它仍然必须执行(因此它的延迟为1).
Zero-idioms also work for vector registers: vpxor xmm0, xmm0, xmm0
(produce zero with no dependency on previous value of xmm0 and zero-latency). What is interesting is that vector registers also have a -1 idiom, which is vpcmpeqd xmm0, xmm0, xmm0
- this one is recognized as write-only (comparing the value with itself will always be true), but it still has to execute (so it has latency=1), at least on SKL/Zen CPUs.
More about producing zeros: What is the best way to set a register to zero in x86 assembly: xor, mov or and?
可以在Agner Fog的手册或CPU优化指南中找到具体识别哪些惯用语的信息.TLDR是通用寄存器,只有零成语,向量寄存器有零成语和全一成语.
Specifically which idioms are recognized can be found in Agner Fog's manuals or CPU optimization guides. TLDR is general purpose registers only have zero-idioms, vector registers have zero-idioms and all-ones idioms.
另见:设置所有位CPU有效地注册为1 (提及 lea edx,[rax-1]
).
注意实际功能.从汇编中可以看到,大多数工作实际上是在尝试生成您所请求的特定常量.
Note on the actual function. As you can see from assembly most of the work is actually trying to produce the specific constants that you've requested.
如果您打算对-1,0,1做的所有事情都取决于是否为负/零/正,那么最好生成 left-right
(因为您确保没有溢出,因为这样会使单独的减法结果不足以进行比较-在这种情况下,只需使用-1、0、1),然后在其上进行分支/移动即可.
If all you intend to do with -1,0,1 is branch on whether it's negative/zero/positive, then it would be better to produce left - right
(as long as you ensure there is no overflow because that would make the subtraction result alone insufficient for comparison - in that case just use -1, 0, 1) and then just branch/cmov on that.
这篇关于为什么gcc和clang生成mov reg,-1的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!