C#中的乱序执行 [英] Out-of-order execution in C#

查看:57
本文介绍了C#中的乱序执行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下片段:

static long F(long a, long b, long c, long d) 
{
    return a + b + c + d;
}

产生:

<Program>$.<<Main>$>g__F|0_0(Int64, Int64, Int64, Int64)
    L0000: add rdx, rcx
    L0003: lea rax, [rdx+r8]
    L0007: add rax, r9
    L000a: ret

如果我从 this 中理解正确的话(§ 乱序执行) 手册:上面的代码转换为 ((a + b) + c) + d.为了计算这个,CPU 必须等待第一个括号和第二个括号,依此类推.在这里,我们看到 LEA 位于中间,这意味着它们不能并行执行(如果我理解正确的话).所以作者的建议是:

If I understand correctly from this (§ Out of order execution) manual: The code above translates to ((a + b) + c) + d. And to compute this the CPU has to wait for the 1st parenthesis and for the 2nd and so on. In here we see that LEA is in the middle which means that they can't be executed in parallel (if I understood that correctly). So what the writer suggests is:

在独立"上写括号对:

static long G(long a, long b, long c, long d) 
{
    return (a + b) + (c + d);
}

但这会生成相同的程序集:

but this generates the same assembly:

<Program>$.<<Main>$>g__G|0_1(Int64, Int64, Int64, Int64)
    L0000: add rdx, rcx
    L0003: lea rax, [rdx+r8]
    L0007: add rax, r9
    L000a: ret

相反,这是 GCC (O2)C 代码生成的:

In contrast this is what GCC (O2) generates for C code:

int64_t
f(int64_t a, int64_t b, int64_t c, int64_t d) {
        return a + b + c + d;
}

int64_t
g(int64_t a, int64_t b, int64_t c, int64_t d) {
        return (a + b) + (c + d);
}

这里是输出:

f: 
        add     rcx, rdx        ; I guess -O2 did the job for me.
        add     rcx, r8         ; I guess -O2 did the job for me.
        lea     rax, [rcx+r9]
        ret
g:
        add     rcx, rdx
        add     r8, r9
        lea     rax, [rcx+r8]
        ret

问题

  • 我是否正确理解了手册?2 个 ADD 是否应该一起出现(中间没有 LEA)?如果是,我如何提示 C# 编译器不要忽略我的括号?
  • Question

    • Did I understand the manual correctly? Should the 2 ADDs come with each other (no LEA in the middle)? If yes how can I hint the C# compiler to not ignore my parenthesis?
      • Here is the SharpLab link.
      • Here is the Gotbolt link.

      推荐答案

      整数加法是关联的. 编译器可以利用这一点(as-if 规则";),无论源代码级的操作顺序如何.

      Integer addition is associative. Compilers can take advantage of this (the "as-if rule"), regardless of the source-level order of operations.

      (不幸的是,似乎大多数编译器在这方面做得很差,即使您巧妙地编写源代码也会使情况变得更糟.)

      (Unfortunately it seems most compilers are doing a bad job at this and making it worse even if you write your source cleverly.)

      asm 中的整数溢出没有副作用;即使在像 MIPS 这样的目标上,add 陷阱上的签名溢出,编译器使用 addu 没有,所以他们可以优化.(在 C 中,编译器可以假设操作的源级顺序永远不会溢出,因为那将是未定义的行为.所以他们可以使用陷阱add 在具有它的 ISA 上,对于在 C 抽象机中使用相同输入发生的计算.但是即使 gcc -fwrapv 给出有符号整数溢出,定义良好的 2 的补码环绕行为是 不是默认情况下,编译器确实使用可能允许静默包装而不是陷阱的指令.主要是这样他们就不必关心任何给定的操作是否在C抽象机中出现的值上.UB并不意味着需要故障;-fsanitize=undefined 需要额外的代码来实现这一点.)

      There's no side-effect for integer overflow in asm; even on targets like MIPS where add traps on signed-overflow, compilers use addu which doesn't, so they can optimize. (In C, compilers can assume the source-level order of operations never overflows, because that would be Undefined Behaviour. So they could use trapping add on ISAs that have it, for calculations that happen with the same inputs in the C abstract machine. But even though gcc -fwrapv to give signed-integer overflow well-defined 2's complement wraparound behaviour is not the default, compilers do use instructions that may allow silent wrapping, not trapping. Mostly so they don't have to care about whether any given operation is on values that appear in the C abstract machine or not. UB doesn't mean required-to-fault; -fsanitize=undefined takes extra code to make that happen.)

      例如INT_MAX + INT_MIN + 1 可以计算为 INT_MAX + 1(溢出到 INT_MIN),然后是 .+ INT_MIN 在 2 的补码机上溢出回 0,或者按源顺序没有溢出.相同的最终结果,这就是操作逻辑上可见的全部内容.

      e.g. INT_MAX + INT_MIN + 1 could be evaluated as INT_MAX + 1 (overflowing to INT_MIN), then . + INT_MIN overflowing back to 0 on a 2's complement machine, or in source order with no overflows. Same final result, and that's all that's logically visible from the operation.

      带有乱序 exec 的 CPU 不会尝试重新关联指令,但它们会遵循 asm/机器代码的依赖关系图.

      (一方面,对于运行中的硬件来说,这太多了,另一方面,每个操作的 FLAGS 输出确实取决于您创建的临时对象,并且中断可能会到达任何一点.因此,当所有旧指令完成后,需要在指令边界处恢复正确的架构状态.这意味着编译器的工作是公开 指令级并行性 在 asm 中,不是让硬件使用数学来创建它.另见 现代微处理器90 分钟指南!这个答案)

      (For one thing, that's too much for hardware to consider on the fly, and for another, the FLAGS output of each operation does depend on which temporaries you create, and an interrupt could arrive at any point. So the proper architectural state needs to be recoverable at instruction boundaries when all older instructions have finished. That means it's the compiler's job to expose instruction-level parallelism in the asm, not for the hardware to use math to create it. See also Modern Microprocessors A 90-Minute Guide! and this answer)

      最糟糕的是,自欺欺人/使您在进行这种源代码级优化方面的尝试变得悲观,至少在这种情况下是这样.

      Mostly badly, shooting themselves in the foot / pessimizing your attempt at doing this source-level optimization, at least in this case.

      • C#:删除 ILP,即使它存在于源中;将 (a+b) + (c+d) 序列化为一个线性操作链;3 个周期的延迟.

      • C#: removes ILP even if it exists in the source; serializes (a+b) + (c+d) into one linear chain of operations; 3 cycle latency.

      clang12.0:相同,序列化两个版本.

      clang12.0: same, serializes both versions.

      MSVC:相同,序列化两个版本.

      MSVC: same, serializes both versions.

      GCC11.1 for signed int64_t:保留操作的源顺序.这是一个长期存在的 GCC 遗漏优化错误,它的优化器即使出于某种原因也避免在临时性中引入签名溢出,就像在进行具体实现时,抽象机器中的 UB 所创建的承诺/保证/优化机会倒退一样好像在抽象机器上运行.尽管 GCC 确实知道它可以自动矢量化int加法;它只是在标量表达式中重新排序,其中一些过于保守的检查将带符号的整数与浮点数合并为非关联.

      GCC11.1 for signed int64_t: preserves source order of operations. It's a longstanding GCC missed-optimization bug that its optimizer avoids introducing signed-overflow even in temporaries for some reason, like it backwards as far as the promises / guarantees / optimization opportunities that something being UB in the abstract machine creates when making a concrete implementation that runs as if on the abstract machine. Although GCC does know it can auto-vectorize int addition; it's only reordering within a scalar expression where some overly-conservative check lumps signed integer in with floating-point as non-associative.

      GCC11.1 for uint64_t-fwrapv:视为关联并编译 fg 同理.使用大多数调整选项(包括其他 ISA,如 MIPS 或 PowerPC)进行序列化,但 -march=znver1 碰巧创建了 ILP.(这并不意味着只有 AMD Zen 是超标量,这意味着 GCC 有遗漏优化错误!)

      GCC11.1 for uint64_t or with -fwrapv: treats as associative and compiles f and g the same way. Serializes with most tuning options (including for other ISAs like MIPS or PowerPC), but -march=znver1 happens to create ILP. (This does not mean that only AMD Zen is superscalar, it means GCC has missed-optimization bugs!)

      ICC 2021.1.2:即使在线性源版本 (f) 中也创建 ILP,但使用 add/mov 而不是 LEA 作为最后一步.:/

      ICC 2021.1.2: creates ILP even in the linear source version (f), but uses add/mov instead of LEA as the final step. :/

      <强> Godbolt 用于 clang/MSVC/ICC.

      Godbolt for clang/MSVC/ICC.

      <强> Godbolt 对于 GCC 签名/未签名或使用 -fwrapv.

      理想的做法是从两个独立的加法开始,然后组合成对.这三个添加中的一个应该使用 lea 完成,以将结果输入 RAX,但它可以是三个中的任何一个.在独立函数中,您可以销毁任何传入的 arg 传递寄存器,并且没有真正的理由避免覆盖其中的两个而不是一个.

      Ideal is to start with two independent additions, then combine the pairs. One of those three additions should be done with an lea to get a result into RAX, but it can be any of the three. In a stand-alone function, you're allowed to destroy any of the incoming arg-passing registers and there's no real reason to avoid overwriting two of them instead of just one.

      您确实只需要一个 LEA,因为 2 寄存器寻址模式使其成为比 ADD 更长的指令.

      You do only want one LEA because a 2-register addressing mode makes it a longer instruction than an ADD.

      这篇关于C#中的乱序执行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆