为什么clang用-O0产生效率低的asm(对于这个简单的浮点数和)? [英] Why does clang produce inefficient asm with -O0 (for this simple floating point sum)?

查看:83
本文介绍了为什么clang用-O0产生效率低的asm(对于这个简单的浮点数和)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在llvm clang Apple LLVM版本8.0.0(clang-800.0.42.1)上反汇编此代码:

I am disassembling this code on llvm clang Apple LLVM version 8.0.0 (clang-800.0.42.1):

int main() {
    float a=0.151234;
    float b=0.2;
    float c=a+b;
    printf("%f", c);
}

我没有使用-O规范进行编译,但是我也尝试使用-O0(给出相同的值)和-O2(实际上是计算值并将其预先存储)

I compiled with no -O specifications, but I also tried with -O0 (gives the same) and -O2 (actually computes the value and stores it precomputed)

以下是拆卸的结果(我删除了不相关的零件)

The resulting disassembly is the following (I removed the parts that are not relevant)

->  0x100000f30 <+0>:  pushq  %rbp
    0x100000f31 <+1>:  movq   %rsp, %rbp
    0x100000f34 <+4>:  subq   $0x10, %rsp
    0x100000f38 <+8>:  leaq   0x6d(%rip), %rdi       
    0x100000f3f <+15>: movss  0x5d(%rip), %xmm0           
    0x100000f47 <+23>: movss  0x59(%rip), %xmm1        
    0x100000f4f <+31>: movss  %xmm1, -0x4(%rbp)  
    0x100000f54 <+36>: movss  %xmm0, -0x8(%rbp)
    0x100000f59 <+41>: movss  -0x4(%rbp), %xmm0         
    0x100000f5e <+46>: addss  -0x8(%rbp), %xmm0
    0x100000f63 <+51>: movss  %xmm0, -0xc(%rbp)
    ...

显然,它正在执行以下操作:

Apparently it's doing the following:

  1. 将两个浮点数加载到寄存器xmm0和xmm1
  2. 将它们放入堆栈
  3. 从堆栈中将一个值(不是xmm0之前的一个值)加载到xmm0
  4. 执行添加.
  5. 将结果存储回堆栈中.

我发现效率低下是因为:

I find it inefficient because:

  1. 一切都可以在注册表中完成.我以后不会使用a和b,因此它可以跳过涉及堆栈的任何操作.
  2. 即使它想使用堆栈,如果以不同的顺序执行操作,也可以节省从堆栈中重新加载xmm0的时间.

鉴于编译器总是正确的,为什么选择这种策略?

Given that the compiler is always right, why did it choose this strategy?

推荐答案

-O0(未优化)是默认设置.它告诉编译器您希望它快速编译(较短的编译时间),而 not 则需要花费更多的时间进行编译以生成有效的代码.

-O0 (unoptimized) is the default. It tells the compiler you want it to compile fast (short compile times), not to take extra time compiling to make efficient code.

(-O0并不是字面上没有优化;例如,gcc仍会消除if(1 == 2){ }块中的代码.尤其是gcc比大多数其他编译器仍然更喜欢在-O0处使用乘法逆进行除法,因为它仍然在最终发出asm之前,通过多种内部逻辑表示来转换C源代码.)

(-O0 isn't literally no optimization; e.g. gcc will still eliminate code inside if(1 == 2){ } blocks. Especially gcc more than most other compilers still does things like use multiplicative inverses for division at -O0, because it still transforms your C source through multiple internal representations of the logic before eventually emitting asm.)

此外,即使在-O3位置,编译器始终是正确的"也是夸张的.编译器在大规模方面非常出色,但是在单个循环中仍然普遍存在较小的未优化问题.通常,其影响非常小,但是循环中的指令(或uops)浪费了,可能会浪费无序的执行重新排序窗口中的空间,并且在与另一个线程共享内核时对超线程的友好程度也会降低.请参阅用于测试Collat​​z猜想的C ++代码比手写汇编要快-为什么?有关在简单的特定情况下击败编译器的更多信息.

Plus, "the compiler is always right" is an exaggeration even at -O3. Compilers are very good at a large scale, but minor missed-optimizations are still common within single loops. Often with very low impact, but wasted instructions (or uops) in a loop can eat up space in the out-of-order execution reordering window, and be less hyper-threading friendly when sharing a core with another thread. See C++ code for testing the Collatz conjecture faster than hand-written assembly - why? for more about beating the compiler in a simple specific case.

更重要的是,-O0还意味着对所有与volatile相似的变量进行一致的调试.即,您可以设置一个断点或单个步骤并修改 C变量的值,然后继续执行并使程序按照您期望的方式在C抽象代码上运行C源代码机器.因此,编译器无法进行任何常量传播或值范围简化. (例如,一个已知为非负的整数可以简化使用它的过程,或者在条件始终为真或始终为假的情况下进行运算.)

More importantly,-O0 also implies treating all variables similar to volatile for consistent debugging. i.e. so you can set a breakpoint or single step and modify the value of a C variable, and then continue execution and have the program work the way you'd expect from your C source running on the C abstract machine. So the compiler can't do any constant-propagation or value-range simplification. (e.g. an integer that's known to be non-negative can simplify things using it, or make some if conditions always true or always false.)

(这不像那样糟糕:在一个语句中多次引用同一个变量并不总是导致多次加载;在-O0时,编译器仍会在单个表达式.)

(It's not quite as bad as volatile: multiple references to the same variable within one statement don't always result in multiple loads; at -O0 compilers will still optimize somewhat within a single expression.)

编译器必须通过在语句之间将所有变量存储/重新加载到其内存地址来专门针对-O0进行反优化. (在C和C ++中,每个变量都有一个地址,除非已使用(现在已过时的)register关键字声明了该地址,并且从未使用过它的地址.根据其他变量的规则,可以优化地址,但尚未在-O0处完成

Compilers have to specifically anti-optimize for -O0 by storing/reloading all variables to their memory address between statements. (In C and C++, every variable has an address unless it was declared with the (now obsolete) register keyword and has never had its address taken. Optimizing away the address is possible according to the as-if rule for other variables, but isn't done at -O0)

不幸的是,调试信息格式无法通过寄存器跟踪变量的位置,因此,如果没有这种缓慢而愚蠢的代码生成器,就不可能进行完全一致的调试.

Unfortunately, debug-info formats can't track the location of a variable through registers, so fully consistent debugging isn't possible without this slow-and-stupid code-gen.

如果不需要此功能,则可以使用-Og进行编译以进行灯光优化,而无需进行一致调试所需的反优化. GCC手册建议在通常的编辑/编译/运行周期中使用它,但是在调试时,您会自动优化"许多具有自动存储功能的局部变量.全局变量和函数arg通常仍至少在函数边界处具有其实际值.

If you don't need this, you can compile with -Og for light optimization, and without the anti-optimizations required for consistent debugging. The GCC manual recommends it for the usual edit/compile/run cycle, but you will get "optimized out" for many local variables with automatic storage when debugging. Globals and function args still usually have their actual values, at least at function boundaries.

更糟糕的是,即使您使用GDB的jump命令在另一个源代码行继续执行,-O0仍使代码仍然有效.因此,每个C语句都必须编译成一个完全独立的指令块. (是否可以跳过"/在GDB调试器中?)

Even worse, -O0 makes code that still works even if you use GDB's jump command to continue execution at a different source line. So each C statement has to be compiled into a fully independent block of instructions. (Is it possible to "jump"/"skip" in GDB debugger?)

for()循环无法转换为

for() loops can't be transformed into idiomatic (for asm) do{}while() loops, and other restrictions.

由于上述所有原因,(微)基准测试未优化的代码浪费大量时间;结果取决于您如何编写源代码的愚蠢细节,而这些细节在使用常规优化进行编译时并不重要. -O0-O3的性能不是线性关系;某些代码的速度会比其他代码快得多.

For all the above reasons, (micro-)benchmarking un-optimized code is a huge waste of time; the results depend on silly details of how you wrote the source that don't matter when you compile with normal optimization. -O0 vs. -O3 performance is not linearly related; some code will speed up much more than others.

-O0代码中的瓶颈通常与-O3不同-通常在循环计数器中,该计数器保存在内存中,从而创建了一个约6循环的循环依赖链.这可以在编译器生成的asm中创建有趣的效果,例如

The bottlenecks in -O0 code will often be different from -O3- often on a loop counter that's kept in memory, creating a ~6-cycle loop-carried dependency chain. This can create interesting effects in the compiler-generated asm like Adding a redundant assignment speeds up code when compiled without optimization (which are interesting from an asm perspective, but not for C.)

我的基准测试已被其他方法优化"不是查看-O0代码性能的有效理由. 有关最终分配的信息,请参见 C循环优化帮助示例和有关-O0调整的兔子洞的更多详细信息.

"My benchmark optimized away otherwise" is not a valid justification for looking at the performance of -O0 code. See C loop optimization help for final assignment for an example and more details about the rabbit hole that tuning for -O0 is.

如果要查看编译器如何添加2个变量,请编写一个使用args并返回值的函数.请记住,您只想查看asm,而不要运行它,因此对于需要作为运行时变量的任何内容,您都不需要main或任何数字文字值.

If you want to see how the compiler adds 2 variables, write a function that takes args and returns a value. Remember you only want to look at the asm, not run it, so you don't need a main or any numeric literal values for anything that should be a runtime variable.

另请参见如何删除噪声" ;是从GCC/clang程序集输出中获得的??有关此的更多信息.

See also How to remove "noise" from GCC/clang assembly output? for more about this.

float foo(float a, float b) {
    float c=a+b;
    return c;
}

clang -O3(但是使用-O0会将args溢出到堆栈内存中. (Godbolt使用编译器发出的调试信息根据它们来自哪个C语句对asm指令进行颜色编码.我添加了换行符以显示每个语句的块,但是您可以在上面的Godbolt链接上突出显示该颜色.通常可以很方便地在优化的编译器输出中找到内部循环的有趣部分.)

But with -O0 it spills the args to stack memory. (Godbolt uses debug info emitted by the compiler to colour-code asm instructions according to which C statement they came from. I've added line breaks to show blocks for each statement, but you can see this with colour highlighting on the Godbolt link above. Often very handy for finding the interesting part of an inner loop in optimized compiler output.)

gcc -fverbose-asm将在每行上添加注释,这些行将操作数名称显示为C vars.在优化的代码中,通常是内部tmp名称,而在未优化的代码中,通常是C源代码中的实际变量.我已经手动评论了clang输出,因为它没有做到这一点.

gcc -fverbose-asm will put comments on every line showing the operand names as C vars. In optimized code that's often an internal tmp name, but in un-optimized code it's usual an actual variable from the C source. I've manually commented the clang output because it doesn't do that.

# clang7.0 -O0  also on Godbolt
foo:
    push    rbp
    mov     rbp, rsp                  # make a traditional stack frame
    movss   DWORD PTR [rbp-20], xmm0  # spill the register args
    movss   DWORD PTR [rbp-24], xmm1  # into the red zone (below RSP)

    movss   xmm0, DWORD PTR [rbp-20]  # a
    addss   xmm0, DWORD PTR [rbp-24]  # +b
    movss   DWORD PTR [rbp-4], xmm0   # store c

    movss   xmm0, DWORD PTR [rbp-4]   # return 0
    pop     rbp                       # epilogue
    ret

有趣的事实:使用register float c = a+b;,返回值可以保留在语句之间的XMM0中,而不是被溢出/重新加载.变量没有地址. (我在Godbolt链接中包含了该功能的版本.)

Fun fact: using register float c = a+b;, the return value can stay in XMM0 between statements, instead of being spilled/reloaded. The variable has no address. (I included that version of the function in the Godbolt link.)

register关键字在优化的代码中无效(除了使获取变量地址错误(如本地的const阻止您意外修改某些内容之外)是错误的).我不建议您使用它,但有趣的是,它确实会影响未优化的代码.

The register keyword has no effect in optimized code (except making it an error to take a variable's address, like how const on a local stops you from accidentally modifying something). I don't recommend using it, but it's interesting to see that it does actually affect un-optimized code.

  • Complex compiler output for simple constructor - every copy of a variable when passing args typically results in extra copies in the asm.
  • Why is this C++ wrapper class not being inlined away? __attribute__((always_inline)) can force inlining, but doesn't optimize away the copying to create the function args, let alone optimize the function into the caller.

这篇关于为什么clang用-O0产生效率低的asm(对于这个简单的浮点数和)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆