为什么 clang 使用 -O0 产生低效的 asm(对于这个简单的浮点和)? [英] Why does clang produce inefficient asm with -O0 (for this simple floating point sum)?

查看:25
本文介绍了为什么 clang 使用 -O0 产生低效的 asm(对于这个简单的浮点和)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在 llvm clang Apple LLVM version 8.0.0 (clang-800.0.42.1) 上反汇编此代码:

I am disassembling this code on llvm clang Apple LLVM version 8.0.0 (clang-800.0.42.1):

int main() {
    float a=0.151234;
    float b=0.2;
    float c=a+b;
    printf("%f", c);
}

我编译时没有使用 -O 规范,但我也尝试使用 -O0(给出相同的值)和 -O2(实际上计算值并存储它预先计算的值)

I compiled with no -O specifications, but I also tried with -O0 (gives the same) and -O2 (actually computes the value and stores it precomputed)

拆解结果如下(我把不相关的部分去掉了)

The resulting disassembly is the following (I removed the parts that are not relevant)

->  0x100000f30 <+0>:  pushq  %rbp
    0x100000f31 <+1>:  movq   %rsp, %rbp
    0x100000f34 <+4>:  subq   $0x10, %rsp
    0x100000f38 <+8>:  leaq   0x6d(%rip), %rdi       
    0x100000f3f <+15>: movss  0x5d(%rip), %xmm0           
    0x100000f47 <+23>: movss  0x59(%rip), %xmm1        
    0x100000f4f <+31>: movss  %xmm1, -0x4(%rbp)  
    0x100000f54 <+36>: movss  %xmm0, -0x8(%rbp)
    0x100000f59 <+41>: movss  -0x4(%rbp), %xmm0         
    0x100000f5e <+46>: addss  -0x8(%rbp), %xmm0
    0x100000f63 <+51>: movss  %xmm0, -0xc(%rbp)
    ...

显然它正在执行以下操作:

Apparently it's doing the following:

  1. 将两个浮点数加载到寄存器 xmm0 和 xmm1
  2. 将它们放入堆栈
  3. 从堆栈中加载一个值(不是之前的 xmm0)到 xmm0
  4. 执行添加.
  5. 将结果存回堆栈.

我觉得它效率低下,因为:

I find it inefficient because:

  1. 一切都可以在注册表中完成.我稍后不再使用 a 和 b,因此它可以跳过任何涉及堆栈的操作.
  2. 即使它想使用堆栈,如果它以不同的顺序执行操作,它也可以避免从堆栈中重新加载 xmm0.

既然编译器总是对的,为什么会选择这种策略?

Given that the compiler is always right, why did it choose this strategy?

推荐答案

-O0(未优化)是默认的.它告诉编译器你希望它编译得快(编译时间短),不要花额外的时间编译来生成高效的代码.

-O0 (unoptimized) is the default. It tells the compiler you want it to compile fast (short compile times), not to take extra time compiling to make efficient code.

(-O0 并不是字面上没有优化;例如 gcc 仍然会消除 if(1 == 2){ } 块内的代码.尤其是 gcc 比大多数其他编译器仍然会做一些事情,比如在 -O0 处使用乘法逆除法,因为它在最终发出 asm 之前仍然通过逻辑的多个内部表示来转换你的 C 源代码.)

(-O0 isn't literally no optimization; e.g. gcc will still eliminate code inside if(1 == 2){ } blocks. Especially gcc more than most other compilers still does things like use multiplicative inverses for division at -O0, because it still transforms your C source through multiple internal representations of the logic before eventually emitting asm.)

另外,即使在 -O3 处,编译器总是正确的"也是一种夸张.编译器在大规模方面非常出色,但在单个循环中仍会出现轻微的遗漏优化.通常具有非常低的影响,但循环中浪费的指令(或 uops)会占用乱序执行重新排序窗口中的空间,并且在与另一个线程共享一个内核时对超线程不那么友好.请参阅 用于测试 Collat​​z 猜想的 C++ 代码比手写程序集更快 - 为什么? 了解更多关于在简单的特定情况下击败编译器的信息.

Plus, "the compiler is always right" is an exaggeration even at -O3. Compilers are very good at a large scale, but minor missed-optimizations are still common within single loops. Often with very low impact, but wasted instructions (or uops) in a loop can eat up space in the out-of-order execution reordering window, and be less hyper-threading friendly when sharing a core with another thread. See C++ code for testing the Collatz conjecture faster than hand-written assembly - why? for more about beating the compiler in a simple specific case.

更重要的是,-O0 还意味着处理所有类似于 volatile 的变量以进行一致的调试.即,您可以设置断点或单步执行并修改 C 变量的值,然后继续执行并让程序按照您期望的方式运行在 C 抽象上的 C 源代码中机器.所以编译器不能做任何常数传播或值范围的简化.(例如,一个已知为非负的整数可以简化使用它的事情,或者使某些 if 条件始终为真或始终为假.)

More importantly,-O0 also implies treating all variables similar to volatile for consistent debugging. i.e. so you can set a breakpoint or single step and modify the value of a C variable, and then continue execution and have the program work the way you'd expect from your C source running on the C abstract machine. So the compiler can't do any constant-propagation or value-range simplification. (e.g. an integer that's known to be non-negative can simplify things using it, or make some if conditions always true or always false.)

(它完全没有volatile那么糟糕:在一个语句中多次引用同一个变量并不总是导致多次加载;在-O0 编译器仍会在单个表达式中进行一些优化.)

(It's not quite as bad as volatile: multiple references to the same variable within one statement don't always result in multiple loads; at -O0 compilers will still optimize somewhat within a single expression.)

编译器必须通过在语句之间存储/重新加载所有变量到它们的内存地址来专门针对 -O0 进行反优化.(在 C 和 C++ 中,每个变量都有一个地址,除非它是用(现在已经过时的)register 关键字声明的,并且从未被使用过它的地址.根据 as-if 优化地址是可能的其他变量的规则,但未在 -O0)

Compilers have to specifically anti-optimize for -O0 by storing/reloading all variables to their memory address between statements. (In C and C++, every variable has an address unless it was declared with the (now obsolete) register keyword and has never had its address taken. Optimizing away the address is possible according to the as-if rule for other variables, but isn't done at -O0)

不幸的是,调试信息格式无法通过寄存器跟踪变量的位置,因此如果没有这种缓慢而愚蠢的代码生成,就不可能实现完全一致的调试.

Unfortunately, debug-info formats can't track the location of a variable through registers, so fully consistent debugging isn't possible without this slow-and-stupid code-gen.

如果你不需要这个,你可以使用 -Og 进行编译以进行轻度优化,而无需进行一致调试所需的反优化.GCC 手册建议将其用于通常的编辑/编译/运行周期,但您将在调试时为许多具有自动存储的局部变量优化".全局变量和函数参数通常仍然具有它们的实际值,至少在函数边界处是这样.

If you don't need this, you can compile with -Og for light optimization, and without the anti-optimizations required for consistent debugging. The GCC manual recommends it for the usual edit/compile/run cycle, but you will get "optimized out" for many local variables with automatic storage when debugging. Globals and function args still usually have their actual values, at least at function boundaries.

更糟糕的是,-O0 使代码仍然有效,即使您使用 GDB 的 jump 命令在不同的源代码行继续执行.因此,每个 C 语句都必须编译成一个完全独立的指令块.(是否可以跳转"/跳过"?在 GDB 调试器中?)

Even worse, -O0 makes code that still works even if you use GDB's jump command to continue execution at a different source line. So each C statement has to be compiled into a fully independent block of instructions. (Is it possible to "jump"/"skip" in GDB debugger?)

for() 循环无法转换为 惯用的(对于 asm)do{}while() 循环,以及其他限制.

for() loops can't be transformed into idiomatic (for asm) do{}while() loops, and other restrictions.

由于上述所有原因,(微)基准测试未优化的代码是对时间的巨大浪费;结果取决于您如何编写源代码的愚蠢细节,当您使用正常优化进行编译时,这些细节无关紧要.-O0 vs. -O3 性能不是线性相关的;某些代码的速度会比其他代码快得多.

For all the above reasons, (micro-)benchmarking un-optimized code is a huge waste of time; the results depend on silly details of how you wrote the source that don't matter when you compile with normal optimization. -O0 vs. -O3 performance is not linearly related; some code will speed up much more than others.

-O0 代码中的瓶颈通常与 -O3 不同——通常在一个保存在内存中的循环计数器上,创建一个 ~6 个循环的循环依赖链.这可以在编译器生成的 asm 中创建有趣的效果,例如 在没有优化的情况下编译时,添加冗余赋值会加快代码速度(从 asm 的角度来看这很有趣,但对于 C 来说不是.)

The bottlenecks in -O0 code will often be different from -O3- often on a loop counter that's kept in memory, creating a ~6-cycle loop-carried dependency chain. This can create interesting effects in the compiler-generated asm like Adding a redundant assignment speeds up code when compiled without optimization (which are interesting from an asm perspective, but not for C.)

我的基准测试优化掉了"不是查看 -O0 代码性能的有效理由.请参阅最终分配的 C 循环优化帮助示例以及有关 -O0 调整的兔子洞的更多详细信息.

"My benchmark optimized away otherwise" is not a valid justification for looking at the performance of -O0 code. See C loop optimization help for final assignment for an example and more details about the rabbit hole that tuning for -O0 is.

如果您想查看编译器如何添加 2 个变量,请编写一个接受 args 并返回值的函数.请记住,您只想查看 asm,而不是运行它,因此您不需要 main 或任何应该是运行时变量的数字文字值.

If you want to see how the compiler adds 2 variables, write a function that takes args and returns a value. Remember you only want to look at the asm, not run it, so you don't need a main or any numeric literal values for anything that should be a runtime variable.

另见如何去除噪音";来自 GCC/clang 程序集输出? 了解更多相关信息.

See also How to remove "noise" from GCC/clang assembly output? for more about this.

float foo(float a, float b) {
    float c=a+b;
    return c;
}

使用 clang -O3 编译 (在 Godbolt 编译器浏览器上) 到预期的

compiles with clang -O3 (on the Godbolt compiler explorer) to the expected

    addss   xmm0, xmm1
    ret

但是使用 -O0 它将参数溢出到堆栈内存.(Godbolt 使用编译器发出的调试信息根据它们来自哪个 C 语句对 asm 指令进行颜色编码.我添加了换行符以显示每个语句的块,但是您可以在上面的 Godbolt 链接上看到带有颜色突出显示的内容. 通常对于在优化的编译器输出中找到内循环的有趣部分非常方便.)

But with -O0 it spills the args to stack memory. (Godbolt uses debug info emitted by the compiler to colour-code asm instructions according to which C statement they came from. I've added line breaks to show blocks for each statement, but you can see this with colour highlighting on the Godbolt link above. Often very handy for finding the interesting part of an inner loop in optimized compiler output.)

gcc -fverbose-asm 将在每一行上添加注释,将操作数名称显示为 C 变量.在优化代码中,通常是内部 tmp 名称,但在未优化代码中,它通常是来自 C 源代码的实际变量.我已经手动评论了 clang 输出,因为它没有这样做.

gcc -fverbose-asm will put comments on every line showing the operand names as C vars. In optimized code that's often an internal tmp name, but in un-optimized code it's usual an actual variable from the C source. I've manually commented the clang output because it doesn't do that.

# clang7.0 -O0  also on Godbolt
foo:
    push    rbp
    mov     rbp, rsp                  # make a traditional stack frame
    movss   DWORD PTR [rbp-20], xmm0  # spill the register args
    movss   DWORD PTR [rbp-24], xmm1  # into the red zone (below RSP)

    movss   xmm0, DWORD PTR [rbp-20]  # a
    addss   xmm0, DWORD PTR [rbp-24]  # +b
    movss   DWORD PTR [rbp-4], xmm0   # store c

    movss   xmm0, DWORD PTR [rbp-4]   # return 0
    pop     rbp                       # epilogue
    ret

有趣的事实:使用register float c = a+b;,返回值可以在语句之间保留在XMM0中,而不是被溢出/重新加载.变量没有地址.(我在 Godbolt 链接中包含了该版本的函数.)

Fun fact: using register float c = a+b;, the return value can stay in XMM0 between statements, instead of being spilled/reloaded. The variable has no address. (I included that version of the function in the Godbolt link.)

register 关键字在优化的代码中没有影响(除了使获取变量的地址出错,就像本地上的 const 如何阻止你意外修改某些东西一样).我不建议使用它,但有趣的是它确实会影响未优化的代码.

The register keyword has no effect in optimized code (except making it an error to take a variable's address, like how const on a local stops you from accidentally modifying something). I don't recommend using it, but it's interesting to see that it does actually affect un-optimized code.

  • Complex compiler output for simple constructor - every copy of a variable when passing args typically results in extra copies in the asm.
  • Why is this C++ wrapper class not being inlined away? __attribute__((always_inline)) can force inlining, but doesn't optimize away the copying to create the function args, let alone optimize the function into the caller.

这篇关于为什么 clang 使用 -O0 产生低效的 asm(对于这个简单的浮点和)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆