将rvalue传递给non-ref参数,为什么编译器不能删除副本? [英] passing rvalue to non-ref parameter, why can't the compiler elide the copy?
问题描述
struct Big {
int a[8];
};
void foo(Big a);
Big getStuff();
void test1() {
foo(getStuff());
}
进行编译(在Linux上对x86_64使用clang 6.0.0,因此System V ABI,标志: -O3 -march = broadwell
)到
compiles (using clang 6.0.0 for x86_64 on Linux so System V ABI, flags: -O3 -march=broadwell
) to
test1(): # @test1()
sub rsp, 72
lea rdi, [rsp + 40]
call getStuff()
vmovups ymm0, ymmword ptr [rsp + 40]
vmovups ymmword ptr [rsp], ymm0
vzeroupper
call foo(Big)
add rsp, 72
ret
如果我没看错,这就是发生的情况:
If I am reading this correctly, this is what is happening:
-
getStuff
传递了指向foo
的堆栈的指针(rsp + 40
)用作其返回值,因此在getStuff
之后返回rsp + 40
到rsp + 71
包含getStuff
的结果。 - 然后将该结果立即复制到较低的堆栈地址
rsp
直到rsp + 31
。 -
foo
然后被调用从rsp
读取其参数。
getStuff
is passed a pointer tofoo
's stack (rsp + 40
) to use for its return value, so aftergetStuff
returnsrsp + 40
through torsp + 71
contains the result ofgetStuff
.- This result is then immediately copied to a lower stack address
rsp
through torsp + 31
. foo
is then called, which will read its argument fromrsp
.
为什么下面的代码不完全等效(为什么编译器不生成它呢?)?
Why is the following code not totally equivalent (and why doesn't the compiler generate it instead)?
test1(): # @test1()
sub rsp, 32
mov rdi, rsp
call getStuff()
call foo(Big)
add rsp, 32
ret
想法是:让 getStuff
直接写到堆栈中的位置, code> foo 将从中读取。
The idea is: have getStuff
write directly to the place in the stack that foo
will read from.
另外:
这是相同代码的结果(具有12个整数)而不是8)由vc ++在Windows上针对x64进行编译,这似乎更糟,因为Windows x64 ABI通过引用传递并返回,因此该副本完全未使用!
Also: Here is the result for the same code (with 12 ints instead of 8) compiled by vc++ on windows for x64, which seems even worse because the windows x64 ABI passes and returns by reference, so the copy is completely unused!
_TEXT SEGMENT
$T3 = 32
$T1 = 32
?bar@@YAHXZ PROC ; bar, COMDAT
$LN4:
sub rsp, 88 ; 00000058H
lea rcx, QWORD PTR $T1[rsp]
call ?getStuff@@YA?AUBig@@XZ ; getStuff
lea rcx, QWORD PTR $T3[rsp]
movups xmm0, XMMWORD PTR [rax]
movaps XMMWORD PTR $T3[rsp], xmm0
movups xmm1, XMMWORD PTR [rax+16]
movaps XMMWORD PTR $T3[rsp+16], xmm1
movups xmm0, XMMWORD PTR [rax+32]
movaps XMMWORD PTR $T3[rsp+32], xmm0
call ?foo@@YAHUBig@@@Z ; foo
add rsp, 88 ; 00000058H
ret 0
推荐答案
您是对的; 这似乎是编译器未进行的优化。如果尚未报告此错误( https://bugs.llvm.org/ )
You're right; this looks like a missed-optimization by the compiler. You can report this bug (https://bugs.llvm.org/) if there isn't already a duplicate.
与流行的看法相反,编译器通常不会使 optimized 代码成为可能。它通常足够好,当现代CPU不会过多延长依赖链时,尤其是关键路径依赖链(如果有)时,它们很擅长通过多余的指令进行操作。
Contrary to popular belief, compilers often don't make optimal code. It's often good enough, and modern CPUs are quite good at plowing through excess instructions when they don't lengthen dependency chains too much, especially the critical path dependency chain if there is one.
x86-64 SysV 传递了大型结构如果它们不适合装入两个64位整数寄存器中,则按栈中的值排序,然后它们通过隐藏指针返回。编译器可以并且应该(但不应该)提前计划,并将返回值临时重用作为调用 foo(Big)
的堆栈参数。
x86-64 SysV passes large structs by value on the stack if they don't fit packed into two 64-bit integer registers, and them returns via hidden pointer. The compiler can and should (but doesn't) plan ahead and reuse the return value temporary as the stack-args for the call to foo(Big)
.
gcc7.3,ICC18和MSVC CL19也错过了此优化。 :/我把你的代码改写的 3D%3D相对= nofollow noreferrer>。 gcc使用4x push qword [rsp + 24]
复制,而ICC使用额外的指令将堆栈对齐32。
gcc7.3, ICC18, and MSVC CL19 also miss this optimization. :/ I put your code up on the Godbolt compiler explorer with gcc/clang/ICC/MSVC. gcc uses 4x push qword [rsp+24]
to copy, while ICC uses extra instructions to align the stack by 32.
对于功能而言,使用1x 32字节加载/存储而不是2x 16字节可能并不能证明 vzeroupper
用于MSVC / ICC / clang的成本这个小。 vzeroupper
在主流的Intel CPU上很便宜(只有4 oups),我确实使用 -march = haswell
进行调整
Using 1x 32-byte load/store instead of 2x 16-byte probably doesn't justify the cost of the vzeroupper
for MSVC / ICC / clang, for a function this small. vzeroupper
is cheap on mainstream Intel CPUs (only 4 uops), and I did use -march=haswell
to tune for that, not for AMD or KNL where it's more expensive.
相关:x86-64 Windows通过隐藏的指针传递大型结构,以及以这种方式返回它们。被叫方拥有指向的内存。 (当您在组装级别发生什么具有输入量较大的函数)
Related: x86-64 Windows passes large structs by hidden pointer, as well as returning them that way. The callee owns the pointed-to memory. (What happens at assembly level when you have functions with large inputs)
通过在第一次调用<$ c之前保留临时空间+影子空间的空间,仍然可以使用此优化功能$ c> getStuff(),并允许被调用者销毁该临时文件,因为我们以后不再需要它。
This optimization would still be available by simply reserving space for the temporary + shadow-space before the first call to getStuff()
, and allowing the callee to destroy the temporary because we don't need it later.
实际上不是但是,不幸的是,MSVC会在此处或在相关情况下进行操作。
That's not actually what MSVC does here or in related cases, though, unfortunately.
另请参阅为什么不通过引用传递struct是常见的优化?。如果您要设计一个调用约定避免通过隐藏的const-reference传递来进行复制(调用方拥有内存,被调用方可以使用该属性,则确保复制构造函数始终可以在非平凡复制对象的正常位置运行)是有问题的。
See also @BeeOnRope's answer, and my comments onit, on Why isn't pass struct by reference a common optimization?. Making sure the copy-constructor can always run at a sane place for non-trivially-copyable objects is problematic if you're trying to design a calling convention that avoids copying by passing by hidden const-reference (caller owns the memory, callee can copy if needed).
但这是一个示例,其中非常量引用(被调用方拥有内存)是最好的,因为调用方想移交
But this is an example of a case where non-const reference (callee owns the memory) is best, because the caller wants to hand off the object to the callee.
有一个潜在的陷阱:如果有任何指向此对象的指针,让被调用者直接使用它可能会引入错误强>。考虑执行 global_pointer-> a [4] = 0;
的其他函数。如果我们的被调用者调用那个函数,它将意外地修改我们被调用者的按值arg。
There's a potential gotcha, though: if there are any pointers to this object, letting the callee use it directly could introduce bugs. Consider some other function that does global_pointer->a[4]=0;
. If our callee calls that function, it will unexpectedly modify our callee's by-value arg.
因此,让被调用者销毁对象的副本在Windows x64调用约定中,仅当转义分析可以证明没有其他对象指向此对象的指针时,该方法才有效。
So letting the callee destroy our copy of the object in the Windows x64 calling convention only works if escape analysis can prove that nothing else has a pointer to this object.
这篇关于将rvalue传递给non-ref参数,为什么编译器不能删除副本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!