为什么gcc和clang生成mov reg,-1 [英] Why do gcc and clang generate mov reg,-1

查看:111
本文介绍了为什么gcc和clang生成mov reg,-1的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用编译器资源管理器查看gcc和clang的一些输出,以了解这些编译器为某些代码发出的汇编代码.最近,我查看了这段代码的输出.

I am using compiler explorer to look at some outputs from gcc and clang to get an idea of what assembly these compilers emit for some code. Recently I looked at the output of this code.

int compare_int64(int64_t left, int64_t right)
{
    return (left < right) ? -1 : (left > right) ? 1 : 0;
}

本练习的重点不是针对C ++,无论如何都可以内联此代码,而是在调用此类函数时.

The point of this exercise is not for C++ where this code might be inlined anyway but when such functions are being called.

使用-O3,这是输出:

With -O3 this is the output:

c声:

xor     ecx, ecx
cmp     rdi, rsi
setg    cl
mov     eax, -1
cmovge  eax, ecx
ret

gcc:

xor     eax, eax
cmp     rdi, rsi
mov     edx, -1
setg    al
cmovl   eax, edx
ret

我注意到该代码的大小为17个字节,仅比一个漂亮的16个字节(仅1个字节)(我正在使用的另一个非C ++编译器中x64的默认代码对齐为16).对于所示的gcc代码,我正在考虑使用 lea edx,[eax-1] 或edx,-1 (在 cmp 当然)以减少代码大小.有趣的是,在使用-Os时,gcc插入了一条 jl 指令,这对该函数的性能而言确实是灾难性的.

I noticed this code is 17 bytes in size which is just 1 byte over a nice 16 byte (the default code alignment for x64 in another non-C++ compiler I am using is 16). For the gcc code shown I was thinking of either using lea edx,[eax-1] or or edx,-1 (latter before the cmp of course) to reduce the code size. Interestingly when using -Os gcc inserts a jl instruction which is kinda disastrous for the performance of that function.

我不是专家,请查阅Agner Fog的说明表手册,如果我没有误会 mov lea or 时间/等待时间都相等.

I am no expert and looked into the instruction tables manual by Agner Fog and if I am not mistaken for mov, lea and or the timings/latency are equal.

因此,实际问题是:为什么两个编译器都使用5字节大小的指令而不是较短的3或4字节指令?用 lea reg,[eax-1] or reg,-1 替换 mov reg,-1 是否无害?/p>

So the actual question(s): Why do both compilers use a 5byte size instruction instead of a shorter 3- or 4-byte instruction? Would it be harmless to replace the mov reg,-1 with lea reg,[eax-1] or or reg,-1?

推荐答案

优化速度 mov reg时,使用-1 代替 or -1,因为前者将寄存器用作只写"寄存器.CPU知道的操作数,并使用该操作数有效地对其进行调度(乱序).尽管或reg,-1 始终会产生 -1 ,但CPU仍未将其识别为依赖项中断(仅写)指令.

When optimizing for speed mov reg, -1 is used instead of or reg, -1 because the former uses the register as a "write-only" operand, which CPU knows about and uses that to schedule it efficiently (out of order). Whereas or reg, -1, even though will always produce -1 is not recognized by the CPU as a dependency-breaking (write only) instruction.

说明它如何影响性能:

mov eax, [rdi]  # Imagine a cache-miss here.
mov [rsi], eax
mov eax, -1     # `mov eax, -1` is able to dispatch and execute without waiting
                # for the cache-miss to be served.
add eax, edx    # `add eax, edx` only needs to wait 1 cycle for `mov` to
                # complete (assuming `edx` is ready) and then it can
                # dispatch while cache-miss load from a few lines above
                # is still in progress.

现在这段代码:

mov eax, [rdi]   # Imagine a cache-miss here.
mov [rsi], eax
or eax, -1       # Now this instruction has to wait for the cache-miss
                 # load to complete.
add eax, edx     # And this one will be waiting too.

(示例适用于当前的任何x86-64 CPU,例如Skylake/Ice Lake/Zen).

(Example applies for any current x86-64 CPU, such as Skylake/Ice Lake/Zen).

如果您正在汇编中编写代码,并确定寄存器不属于当前正在进行的依赖关系链,则可以使用或reg,-1 ,它将具有没有负面影响(当然,如果您的假设正确).

If you're writing the code in assembly and certain that a register is not part of a dependency-chain that's currently in progress, you can use or reg, -1 and it'll have no negative effect (if your assumptions are right, of course).

由于存在偶然附加到依赖链的危险,因此编译器通常在优化速度时通常不使用或reg,-1 来生成-1.

Because of that danger of accidentally attaching to a dependency chain compilers do not generally use or reg, -1 for producing -1 when optimizing for speed.

当我们需要一个零而不是-1时,我们很幸运,因为CPU可以识别成语,例如 xor reg,reg sub reg,reg .它们的代码较小,CPU识别出计算结果与寄存器无关(始终为零).

When we need a zero, instead of -1, we're in luck because there are idioms that CPUs recognize, for example xor reg, reg and sub reg, reg. They're smaller in code size, and CPUs recognize that the result of the computation does not depend on the register (always zero).

这些零成语除了代码量较小之外,通常还由CPU的前端部分处理,因此取决于结果的指令将可以立即分派.

These zero-idioms, on top of being smaller in code size, are also usually processed by the front-end part of the CPU, so the instructions depending on the result will immediately be able to dispatch.

零惯用语也适用于矢量寄存器: vpxor xmm0,xmm0,xmm0 (产生零,不依赖于xmm0的先前值和零延迟).有趣的是向量寄存器也有一个-1习惯用语,即 vpcmpeqd xmm0,xmm0,xmm0 -该寄存器被认为是只写的(将值与自身进行比较将始终为true),但至少在SKL/Zen CPU上,它仍然必须执行(因此它的延迟为1).

Zero-idioms also work for vector registers: vpxor xmm0, xmm0, xmm0 (produce zero with no dependency on previous value of xmm0 and zero-latency). What is interesting is that vector registers also have a -1 idiom, which is vpcmpeqd xmm0, xmm0, xmm0 - this one is recognized as write-only (comparing the value with itself will always be true), but it still has to execute (so it has latency=1), at least on SKL/Zen CPUs.

有关产生零的更多信息:

More about producing zeros: What is the best way to set a register to zero in x86 assembly: xor, mov or and?

可以在Agner Fog的手册或CPU优化指南中找到具体识别哪些惯用语的信息.TLDR是通用寄存器,只有零成语,向量寄存器有零成语和全一成语.

Specifically which idioms are recognized can be found in Agner Fog's manuals or CPU optimization guides. TLDR is general purpose registers only have zero-idioms, vector registers have zero-idioms and all-ones idioms.

另见:设置所有位CPU有效地注册为1 (提及 lea edx,[rax-1] ).

注意实际功能.从汇编中可以看到,大多数工作实际上是在尝试生成您所请求的特定常量.

Note on the actual function. As you can see from assembly most of the work is actually trying to produce the specific constants that you've requested.

如果您打算对-1,0,1做的所有事情都取决于是否为负/零/正,那么最好生成 left-right (因为您确保没有溢出,因为这样会使单独的减法结果不足以进行比较-在这种情况下,只需使用-1、0、1),然后在其上进行分支/移动即可.

If all you intend to do with -1,0,1 is branch on whether it's negative/zero/positive, then it would be better to produce left - right (as long as you ensure there is no overflow because that would make the subtraction result alone insufficient for comparison - in that case just use -1, 0, 1) and then just branch/cmov on that.

这篇关于为什么gcc和clang生成mov reg,-1的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆