添加冗余分配可加速未经优化的代码 [英] Adding a redundant assignment speeds up code when compiled without optimization

查看：100 发布时间：2020/9/12 21:22:03 performance assembly x86

本文介绍了添加冗余分配可加速未经优化的代码的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我发现一个有趣的现象:

I find an interesting phenomenon:

#include<stdio.h>
#include<time.h>

int main() {
    int p, q;
    clock_t s,e;
    s=clock();
    for(int i = 1; i < 1000; i++){
        for(int j = 1; j < 1000; j++){
            for(int k = 1; k < 1000; k++){
                p = i + j * k;
                q = p;  //Removing this line can increase running time.
            }
        }
    }
    e = clock();
    double t = (double)(e - s) / CLOCKS_PER_SEC;
    printf("%lf\n", t);
    return 0;
}

我在 i5-5257U Mac OS 上使用 GCC 7.3.0 来编译代码，而没有进行任何优化.这是10倍以上的平均运行时间: 还有其他人在其他Intel平台上测试该案例并获得相同的结果.
我在此处中发布了由GCC生成的程序集.两种汇编代码之间的唯一区别是，在addl $1, -12(%rbp)之前，速度更快的汇编代码有两项操作:

I use GCC 7.3.0 on i5-5257U Mac OS to compile the code without any optimization. Here is the average run time over 10 times: There are also other people who test the case on other Intel platforms and get the same result.
I post the assembly generated by GCC here. The only difference between two assembly codes is that before addl $1, -12(%rbp) the faster one has two more operations:

movl    -44(%rbp), %eax
movl    %eax, -48(%rbp)

那为什么在这样的分配下程序运行得更快?

So why does the program run faster with such an assignment?

Peter的答案非常有帮助.在 AMD Phenom II X4 810 和 ARMv7处理器(BCM2835)上进行的测试显示了相反的结果，该结果支持存储转发加速特定于某些Intel CPU.
并且 BeeOnRope的评论和建议驱使我重写问题. :)
这个问题的核心是与处理器架构和组装相关的有趣现象.因此，我认为值得讨论.

Peter's answer is very helpful. The tests on an AMD Phenom II X4 810 and an ARMv7 processor (BCM2835) shows an opposite result which supports that store-forwarding speedup is specific to some Intel CPU.
And BeeOnRope's comment and advice drives me to rewrite the question. :)
The core of this question is the interesting phenomenon which is related to processor architecture and assembly. So I think it may be worth to be discussed.

推荐答案

TL:DR:如果不尝试立即进行重新加载，Sandybridge-family存储转发的延迟较低. .添加无用的代码可以加快调试模式的循环速度，因为-O0反优化代码中的循环传递延迟瓶颈几乎总是涉及某些C变量的存储/重载. 其他示例:超线程，调用空函数，通过指针访问vars .

TL:DR: Sandybridge-family store-forwarding has lower latency if the reload doesn't try to happen "right away". Adding useless code can speed up a debug-mode loop because loop-carried latency bottlenecks in -O0 anti-optimized code almost always involve store/reload of some C variables.
Other examples: hyperthreading, calling an empty function, accessing vars through pointers.

与优化代码无关.存储转发延迟有时会出现瓶颈，但是在代码中添加无用的复杂性并不能加快速度.

None of this is relevant for optimized code. Bottlenecks on store-forwarding latency can occasionally happen, but adding useless complications to your code won't speed it up.

您正在对调试版本进行基准测试，基本上是没用的 .

You're benchmarking a debug build, which is basically useless.

但是显然，一个版本的调试版本比另一个版本的调试版本运行慢是有原因的. (假设您正确地进行了测量，而不仅仅是导致挂钟时间有所不同的CPU频率变化(涡轮增压/省电).)

But obviously there is a real reason for the debug build of one version running slower than the debug build of the other version. (Assuming you measured correctly and it wasn't just CPU frequency variation (turbo / power-saving) leading to a difference in wall-clock time.)

如果您想了解x86性能分析的详细信息，我们可以尝试解释为什么asm首先执行它的方式，以及为什么来自额外C语句(使用-O0进行编译)的asm额外的asm说明)可以使其整体速度更快. 这将告诉我们有关asm性能影响的一些信息，但对优化C没什么帮助.

If you want to get into the details of x86 performance analysis, we can try to explain why the asm performs the way it does in the first place, and why the asm from an extra C statement (which with -O0 compiles to extra asm instructions) could make it faster overall. This will tell us something about asm performance effects, but nothing useful about optimizing C.

您没有显示整个内部循环，仅显示了部分循环主体，但是gcc -O0是

You haven't shown the whole inner loop, only some of the loop body, but gcc -O0 is pretty predictable. Every C statement is compiled separately from all the others, with all C variables spilled / reloaded between the blocks for each statement. This lets you change variables with a debugger while single-stepping, or even jump to a different line in the function, and have the code still work. The performance cost of compiling this way is catastrophic. For example, your loop has no side-effects (none of the results are used) so the entire triple-nested loop can and would compile to zero instructions in a real build, running infinitely faster. Or more realistically, running 1 cycle per iteration instead of ~6 even without optimizing away or doing major transformations.

瓶颈可能是对k的循环依赖，具有存储/重载和add递增.在大多数CPU上，存储转发延迟通常大约5个周期.因此，您的内部循环仅限于每〜6个周期运行一次，这是内存目标的延迟add.

The bottleneck is probably the loop-carried dependency on k, with a store/reload and an add to increment. Store-forwarding latency is typically around 5 cycles on most CPUs. And thus your inner loop is limited to running once per ~6 cycles, the latency of memory-destination add.

如果您使用的是Intel CPU，当重新加载无法立即尝试执行时，存储/重新加载延迟实际上会更低(更好).在依赖对之间有更多独立的加载/存储可能会解释您的情况.请参见使用函数调用比空循环更快的循环.

If you're on an Intel CPU, store/reload latency can actually be lower (better) when the reload can't try to execute right away. Having more independent loads/stores in between the dependent pair may explain it in your case. See Loop with function call faster than an empty loop.

因此，在循环中需要进行更多工作时，addl $1, -12(%rbp)可以在背靠背运行时维持每6个周期的吞吐量之一，而可能仅会导致每4或5个周期进行一次迭代的瓶颈.

So with more work in the loop, that addl $1, -12(%rbp) which can sustain one per 6 cycle throughput when run back-to-back might instead only create a bottleneck of one iteration per 4 or 5 cycles.

根据测量结果，这种影响显然发生在Sandybridge和Haswell(不仅是Skylake)上，

This effect apparently happens on Sandybridge and Haswell (not just Skylake), according to measurements from a 2013 blog post, so yes, this is the most likely explanation on your Broadwell i5-5257U, too. It appears that this effect happens on all Intel Sandybridge-family CPUs.

没有关于测试硬件，编译器版本(或内部循环的asm源)，以及两个版本的绝对和/或相对性能数字的更多信息，是我尽力而为的一种解释.在我的Skylake系统上进行基准测试/分析gcc -O0不够有趣，无法自己亲自尝试.下次，请提供计时号码.

Without more info on your test hardware, compiler version (or asm source for the inner loop), and absolute and/or relative performance numbers for both versions, this is my best low-effort guess at an explanation. Benchmarking / profiling gcc -O0 on my Skylake system isn't interesting enough to actually try it myself. Next time, include timing numbers.

不属于循环承载的依赖链的所有工作的存储/重新加载的延迟并不重要，仅吞吐量无关紧要.现代无序CPU中的存储队列确实有效地提供了内存重命名，从而消除了写入后-重复使用同一堆栈内存来写入p，然后在其他地方进行读取和写入，则会产生写入和读取后写入的危害. (有关更多信息，请参见 https://en.wikipedia.org/wiki/Memory_disambiguation#Avoiding_WAR_and_WAW_dependencies 有关内存危险的详细信息，以及

The latency of the stores/reloads for all the work that isn't part of the loop-carried dependency chain doesn't matter, only the throughput. The store queue in modern out-of-order CPUs does effectively provide memory renaming, eliminating write-after-write and write-after-read hazards from reusing the same stack memory for p being written and then read and written somewhere else. (See https://en.wikipedia.org/wiki/Memory_disambiguation#Avoiding_WAR_and_WAW_dependencies for more about memory hazards specifically, and this Q&A for more about latency vs. throughput and reusing the same register / register renaming)

内部循环的多个迭代可以一次运行，因为内存顺序缓冲区可以跟踪每个负载需要从哪个存储区获取数据，而无需将先前的存储区移至同一位置来提交给L1D并获取离开存储队列. (有关CPU微体系结构内部的更多信息，请参阅Intel的优化手册和Agner Fog的microarch PDF.)

Multiple iterations of the inner loop can be in flight at once, because the memory-order buffer keeps track of which store each load needs to take data from, without requiring a previous store to the same location to commit to L1D and get out of the store queue. (See Intel's optimization manual and Agner Fog's microarch PDF for more about CPU microarchitecture internals.)

通常，不，不是.编译器将循环变量保存在最内层循环的寄存器中.而无用的语句实际上将在启用优化的情况下优化.

In general, no, it doesn't. Compilers keep loop variables in registers for the innermost loops. And useless statements will actually optimize away with optimization enabled.

为gcc -O0调整源代码是没有用的.使用-O3或项目使用的默认构建脚本的任何选项进行度量.

Tuning your source for gcc -O0 is useless. Measure with -O3, or whatever options the default build scripts for your project use.

此外，这种存储转发速度是特定于Intel Sandybridge系列的，除非在Ryzen之类的其他微体系结构上，您也不会看到它，除非它们也具有类似的存储转发延迟效果.

Also, this store-forwarding speedup is specific to Intel Sandybridge-family, and you won't see it on other microarchitectures like Ryzen, unless they also have a similar store-forwarding latency effect.

存储转发延迟可能是实际(优化)编译器输出中的一个问题，尤其是如果您没有使用链接时间优化(LTO)来使微小功能内联，尤其是那些通过引用传递或返回任何内容(因此它必须通过内存而不是寄存器).如果您真的想在Intel CPU上解决问题，则缓解此问题可能需要使用volatile这样的黑客程序，并且在某些其他CPU上可能会使情况更糟.请参见评论中的讨论

Store-forwarding latency can be a problem in real (optimized) compiler output, especially if you didn't use link-time-optimization (LTO) to let tiny functions inline, especially functions that pass or return anything by reference (so it has to go through memory instead of registers). Mitigating the problem may require hacks like volatile if you really want to just work around it on Intel CPUs and maybe make things worse on some other CPUs. See discussion in comments

这篇关于添加冗余分配可加速未经优化的代码的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

添加冗余分配可加速未经优化的代码 [英] Adding a redundant assignment speeds up code when compiled without optimization

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

添加冗余分配可加速未经优化的代码 [英] Adding a redundant assignment speeds up code when compiled without optimization

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭