将rvalue传递给non-ref参数,为什么编译器不能删除副本? [英] passing rvalue to non-ref parameter, why can't the compiler elide the copy?

查看:94
本文介绍了将rvalue传递给non-ref参数,为什么编译器不能删除副本?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

struct Big {
    int a[8];
};
void foo(Big a);
Big getStuff();
void test1() {
    foo(getStuff());
}

进行编译(在Linux上对x86_64使用clang 6.0.0,因此System V ABI,标志: -O3 -march = broadwell )到

compiles (using clang 6.0.0 for x86_64 on Linux so System V ABI, flags: -O3 -march=broadwell) to

test1():                              # @test1()
        sub     rsp, 72
        lea     rdi, [rsp + 40]
        call    getStuff()
        vmovups ymm0, ymmword ptr [rsp + 40]
        vmovups ymmword ptr [rsp], ymm0
        vzeroupper
        call    foo(Big)
        add     rsp, 72
        ret

如果我没看错,这就是发生的情况:

If I am reading this correctly, this is what is happening:


  1. getStuff 传递了指向 foo 的堆栈的指针( rsp + 40 )用作其返回值,因此在 getStuff 之后返回 rsp + 40 rsp + 71 包含 getStuff 的结果。

  2. 然后将该结果立即复制到较低的堆栈地址 rsp 直到 rsp + 31

  3. foo 然后被调用从 rsp 读取其参数。

  1. getStuff is passed a pointer to foo's stack (rsp + 40) to use for its return value, so after getStuff returns rsp + 40 through to rsp + 71 contains the result of getStuff.
  2. This result is then immediately copied to a lower stack address rsp through to rsp + 31.
  3. foo is then called, which will read its argument from rsp.

为什么下面的代码不完全等效(为什么编译器不生成它呢?)?

Why is the following code not totally equivalent (and why doesn't the compiler generate it instead)?

test1():                              # @test1()
        sub     rsp, 32
        mov     rdi, rsp
        call    getStuff()
        call    foo(Big)
        add     rsp, 32
        ret

想法是:让 getStuff 直接写到堆栈中的位置, code> foo 将从中读取。

The idea is: have getStuff write directly to the place in the stack that foo will read from.

另外:
这是相同代码的结果(具有12个整数)而不是8)由vc ++在Windows上针对x64进行编译,这似乎更糟,因为Windows x64 ABI通过引用传递并返回,因此该副本完全未使用!

Also: Here is the result for the same code (with 12 ints instead of 8) compiled by vc++ on windows for x64, which seems even worse because the windows x64 ABI passes and returns by reference, so the copy is completely unused!

_TEXT   SEGMENT
$T3 = 32
$T1 = 32
?bar@@YAHXZ PROC                    ; bar, COMDAT

$LN4:
    sub rsp, 88                 ; 00000058H

    lea rcx, QWORD PTR $T1[rsp]
    call    ?getStuff@@YA?AUBig@@XZ         ; getStuff
    lea rcx, QWORD PTR $T3[rsp]
    movups  xmm0, XMMWORD PTR [rax]
    movaps  XMMWORD PTR $T3[rsp], xmm0
    movups  xmm1, XMMWORD PTR [rax+16]
    movaps  XMMWORD PTR $T3[rsp+16], xmm1
    movups  xmm0, XMMWORD PTR [rax+32]
    movaps  XMMWORD PTR $T3[rsp+32], xmm0
    call    ?foo@@YAHUBig@@@Z           ; foo

    add rsp, 88                 ; 00000058H
    ret 0


推荐答案

您是对的; 这似乎是编译器未进行的优化。如果尚未报告此错误( https://bugs.llvm.org/

You're right; this looks like a missed-optimization by the compiler. You can report this bug (https://bugs.llvm.org/) if there isn't already a duplicate.

与流行的看法相反,编译器通常不会使 optimized 代码成为可能。它通常足够好,当现代CPU不会过多延长依赖链时,尤其是关键路径依赖链(如果有)时,它们很擅长通过多余的指令进行操作。

Contrary to popular belief, compilers often don't make optimal code. It's often good enough, and modern CPUs are quite good at plowing through excess instructions when they don't lengthen dependency chains too much, especially the critical path dependency chain if there is one.

x86-64 SysV 传递了大型结构如果它们不适合装入两个64位整数寄存器中,则按栈中的值排序,然后它们通过隐藏指针返回。编译器可以并且应该(但不应该)提前计划,并将返回值临时重用作为调用 foo(Big)的堆栈参数。

x86-64 SysV passes large structs by value on the stack if they don't fit packed into two 64-bit integer registers, and them returns via hidden pointer. The compiler can and should (but doesn't) plan ahead and reuse the return value temporary as the stack-args for the call to foo(Big).

gcc7.3,ICC18和MSVC CL19也错过了此优化。 :/我把你的代码改写的 3D%3D相对= nofollow noreferrer>。 gcc使用4x push qword [rsp + 24] 复制,而ICC使用额外的指令将堆栈对齐32。

gcc7.3, ICC18, and MSVC CL19 also miss this optimization. :/ I put your code up on the Godbolt compiler explorer with gcc/clang/ICC/MSVC. gcc uses 4x push qword [rsp+24] to copy, while ICC uses extra instructions to align the stack by 32.

对于功能而言,使用1x 32字节加载/存储而不是2x 16字节可能并不能证明 vzeroupper 用于MSVC / ICC / clang的成本这个小。 vzeroupper 在主流的Intel CPU上很便宜(只有4 oups),我确实使用 -march = haswell 进行调整

Using 1x 32-byte load/store instead of 2x 16-byte probably doesn't justify the cost of the vzeroupper for MSVC / ICC / clang, for a function this small. vzeroupper is cheap on mainstream Intel CPUs (only 4 uops), and I did use -march=haswell to tune for that, not for AMD or KNL where it's more expensive.

相关:x86-64 Windows通过隐藏的指针传递大型结构,以及以这种方式返回它们。被叫方拥有指向的内存。 (当您在组装级别发生什么具有输入量较大的函数

Related: x86-64 Windows passes large structs by hidden pointer, as well as returning them that way. The callee owns the pointed-to memory. (What happens at assembly level when you have functions with large inputs)

通过在第一次调用<$ c之前保留临时空间+影子空间的空间,仍然可以使用此优化功能$ c> getStuff(),并允许被调用者销毁该临时文件,因为我们以后不再需要它。

This optimization would still be available by simply reserving space for the temporary + shadow-space before the first call to getStuff(), and allowing the callee to destroy the temporary because we don't need it later.

实际上不是但是,不幸的是,MSVC会在此处或在相关情况下进行操作。

That's not actually what MSVC does here or in related cases, though, unfortunately.

另请参阅为什么不通过引用传递struct是常见的优化?。如果您要设计一个调用约定避免通过隐藏的const-reference传递来进行复制(调用方拥有内存,被调用方可以使用该属性,则确保复制构造函数始终可以在非平凡复制对象的正常位置运行)是有问题的。

See also @BeeOnRope's answer, and my comments onit, on Why isn't pass struct by reference a common optimization?. Making sure the copy-constructor can always run at a sane place for non-trivially-copyable objects is problematic if you're trying to design a calling convention that avoids copying by passing by hidden const-reference (caller owns the memory, callee can copy if needed).

但这是一个示例,其中非常量引用(被调用方拥有内存)是最好的,因为调用方想移交

But this is an example of a case where non-const reference (callee owns the memory) is best, because the caller wants to hand off the object to the callee.

有一个潜在的陷阱:如果有任何指向此对象的指针,让被调用者直接使用它可能会引入错误强>。考虑执行 global_pointer-> a [4] = 0; 的其他函数。如果我们的被调用者调用那个函数,它将意外地修改我们被调用者的按值arg。

There's a potential gotcha, though: if there are any pointers to this object, letting the callee use it directly could introduce bugs. Consider some other function that does global_pointer->a[4]=0;. If our callee calls that function, it will unexpectedly modify our callee's by-value arg.

因此,让被调用者销毁对象的副本在Windows x64调用约定中,仅当转义分析可以证明没有其他对象指向此对象的指针时,该方法才有效。

So letting the callee destroy our copy of the object in the Windows x64 calling convention only works if escape analysis can prove that nothing else has a pointer to this object.

这篇关于将rvalue传递给non-ref参数,为什么编译器不能删除副本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆