添加冗余分配可在未经优化的情况下编译时加快代码速度 [英] Adding a redundant assignment speeds up code when compiled without optimization

查看:27
本文介绍了添加冗余分配可在未经优化的情况下编译时加快代码速度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我发现了一个有趣的现象:

I find an interesting phenomenon:

#include<stdio.h>
#include<time.h>

int main() {
    int p, q;
    clock_t s,e;
    s=clock();
    for(int i = 1; i < 1000; i++){
        for(int j = 1; j < 1000; j++){
            for(int k = 1; k < 1000; k++){
                p = i + j * k;
                q = p;  //Removing this line can increase running time.
            }
        }
    }
    e = clock();
    double t = (double)(e - s) / CLOCKS_PER_SEC;
    printf("%lf
", t);
    return 0;
}

我在 i5-5257U Mac OS 上使用 GCC 7.3.0 编译代码没有任何优化.这是超过 10 次的平均运行时间:也有其他人在其他 Intel 平台上测试此案例并得到相同结果.
我将 GCC 生成的程序集发布在 此处.两个汇编代码之间唯一的区别是在addl $1, -12(%rbp) 之前,速度更快的还有两个操作:

I use GCC 7.3.0 on i5-5257U Mac OS to compile the code without any optimization. Here is the average run time over 10 times: There are also other people who test the case on other Intel platforms and get the same result.
I post the assembly generated by GCC here. The only difference between two assembly codes is that before addl $1, -12(%rbp) the faster one has two more operations:

movl    -44(%rbp), %eax
movl    %eax, -48(%rbp)

那么为什么程序在这样的分配下运行得更快?

So why does the program run faster with such an assignment?

彼得的回答非常有帮助.在 AMD Phenom II X4 810ARMv7 处理器 (BCM2835) 上的测试显示了相反的结果,支持存储转发加速特定于某些 Intel CPU.
BeeOnRope 的评论和建议 促使我重写问题.:)
这个问题的核心是与处理器架构和组装相关的有趣现象.所以我认为这可能值得讨论.

Peter's answer is very helpful. The tests on an AMD Phenom II X4 810 and an ARMv7 processor (BCM2835) shows an opposite result which supports that store-forwarding speedup is specific to some Intel CPU.
And BeeOnRope's comment and advice drives me to rewrite the question. :)
The core of this question is the interesting phenomenon which is related to processor architecture and assembly. So I think it may be worth to be discussed.

推荐答案

TL:DR:如果重新加载不尝试立即"发生,Sandybridge-family 存储转发具有更低的延迟.添加无用代码可以加速调试模式循环,因为 -O0 反优化代码中的循环携带延迟瓶颈几乎总是涉及 存储/重新加载一些 C 变量.
这种放缓的其他例子:超线程调用空函数, 访问变量通过指针.

TL:DR: Sandybridge-family store-forwarding has lower latency if the reload doesn't try to happen "right away". Adding useless code can speed up a debug-mode loop because loop-carried latency bottlenecks in -O0 anti-optimized code almost always involve store/reload of some C variables.
Other examples of this slowdown in action: hyperthreading, calling an empty function, accessing vars through pointers.

这些都与优化代码无关.偶尔会发生存储转发延迟的瓶颈,但向您的代码添加无用的复杂性并不会加快速度.

None of this is relevant for optimized code. Bottlenecks on store-forwarding latency can occasionally happen, but adding useless complications to your code won't speed it up.

您正在对调试版本进行基准测试,这基本上没用.它们具有与优化代码不同的瓶颈,而不是一致的减速.

You're benchmarking a debug build, which is basically useless. They have different bottlenecks than optimized code, not a uniform slowdown.

但很明显,一个版本的调试版本比另一个版本的调试版本运行得慢是有原因的.(假设您测量正确,并且不仅仅是 CPU 频率变化(涡轮增压/省电)导致挂钟时间的差异.)

But obviously there is a real reason for the debug build of one version running slower than the debug build of the other version. (Assuming you measured correctly and it wasn't just CPU frequency variation (turbo / power-saving) leading to a difference in wall-clock time.)

如果你想深入了解 x86 性能分析的细节,我们可以尝试解释为什么 asm 首先执行它的方式,以及为什么 asm 来自额外的 C 语句(带有 -O0 编译为额外的 asm 指令)可以使其整体速度更快.这将告诉我们一些有关 asm 性能影响的信息,但对优化 C 没有任何帮助.

If you want to get into the details of x86 performance analysis, we can try to explain why the asm performs the way it does in the first place, and why the asm from an extra C statement (which with -O0 compiles to extra asm instructions) could make it faster overall. This will tell us something about asm performance effects, but nothing useful about optimizing C.

你还没有展示整个内部循环,只展示了一些循环体,但是 gcc -O0相当可预测.每个 C 语句都与所有其他语句分开编译,所有 C 变量在每个语句的块之间溢出/重新加载.这允许您在单步执行时使用调试器更改变量,甚至跳转到函数中的不同行,并且代码仍然有效.以这种方式编译的性能成本是灾难性的.例如,您的循环没有副作用(没有使用任何结果),因此整个三重嵌套循环可以并且会在实际构建中编译为零指令,运行速度无限快.或者更现实地说,即使没有优化或进行重大转换,每次迭代运行 1 个周期而不是 ~6 个周期.

You haven't shown the whole inner loop, only some of the loop body, but gcc -O0 is pretty predictable. Every C statement is compiled separately from all the others, with all C variables spilled / reloaded between the blocks for each statement. This lets you change variables with a debugger while single-stepping, or even jump to a different line in the function, and have the code still work. The performance cost of compiling this way is catastrophic. For example, your loop has no side-effects (none of the results are used) so the entire triple-nested loop can and would compile to zero instructions in a real build, running infinitely faster. Or more realistically, running 1 cycle per iteration instead of ~6 even without optimizing away or doing major transformations.

瓶颈可能是循环携带对k的依赖,有一个存储/重新加载和一个add来增加.存储转发延迟通常在大多数 CPU 上大约为 5 个周期.因此,您的内部循环仅限于每 ~6 个周期运行一次,即内存目标 add 的延迟.

The bottleneck is probably the loop-carried dependency on k, with a store/reload and an add to increment. Store-forwarding latency is typically around 5 cycles on most CPUs. And thus your inner loop is limited to running once per ~6 cycles, the latency of memory-destination add.

如果您使用的是 Intel CPU,当重新加载无法立即尝试执行时,存储/重新加载延迟实际上可以更低(更好).在依赖对之间有更多独立的加载/存储可以解释你的情况.请参阅比空循环更快的函数调用循环.

If you're on an Intel CPU, store/reload latency can actually be lower (better) when the reload can't try to execute right away. Having more independent loads/stores in between the dependent pair may explain it in your case. See Loop with function call faster than an empty loop.

因此,随着循环中的更多工作,在背靠背运行时可以维持每 6 个周期一个吞吐量的 addl $1, -12(%rbp) 可能只会创建每 4 或 5 个周期进行一次迭代的瓶颈.

So with more work in the loop, that addl $1, -12(%rbp) which can sustain one per 6 cycle throughput when run back-to-back might instead only create a bottleneck of one iteration per 4 or 5 cycles.

根据测量结果,这种效果显然发生在 Sandybridge 和 Haswell(不仅仅是 Skylake)上来自 2013 年的博文,所以是的,这也是您 Broadwell i5-5257U 上最有可能的解释.似乎这种影响发生在所有英特尔 Sandybridge 系列 CPU 上.

This effect apparently happens on Sandybridge and Haswell (not just Skylake), according to measurements from a 2013 blog post, so yes, this is the most likely explanation on your Broadwell i5-5257U, too. It appears that this effect happens on all Intel Sandybridge-family CPUs.

如果没有关于您的测试硬件、编译器版本(或内部循环的 asm 源代码)、以及两个版本的绝对和/或相对性能数字的更多信息,这是我对解释最好的省力猜测.在我的 Skylake 系统上进行基准测试/分析 gcc -O0 不够有趣,无法亲自尝试.下一次,包括计时数字.

Without more info on your test hardware, compiler version (or asm source for the inner loop), and absolute and/or relative performance numbers for both versions, this is my best low-effort guess at an explanation. Benchmarking / profiling gcc -O0 on my Skylake system isn't interesting enough to actually try it myself. Next time, include timing numbers.

所有不属于循环携带依赖链的工作的存储/重新加载的延迟无关紧要,只有吞吐量.现代乱序 CPU 中的存储队列确实有效地提供了内存重命名,消除了 write-after-write 和 write-after-read 危险 重用相同的堆栈内存来写入 p 然后在其他地方读取和写入.(参见 https://en.wikipedia.org/wiki/Memory_disambiguation#Avoiding_WAR_and_WAW_dependencies有关内存危害的更多信息,以及 此问答 详细了解延迟与吞吐量以及重用相同的寄存器/寄存器重命名)

The latency of the stores/reloads for all the work that isn't part of the loop-carried dependency chain doesn't matter, only the throughput. The store queue in modern out-of-order CPUs does effectively provide memory renaming, eliminating write-after-write and write-after-read hazards from reusing the same stack memory for p being written and then read and written somewhere else. (See https://en.wikipedia.org/wiki/Memory_disambiguation#Avoiding_WAR_and_WAW_dependencies for more about memory hazards specifically, and this Q&A for more about latency vs. throughput and reusing the same register / register renaming)

内循环的多次迭代可以同时进行,因为内存顺序缓冲区会跟踪每个加载需要从哪个存储中获取数据,而不需要先前存储到同一位置以提交到 L1D 并获取出商店队列.(有关 CPU 微架构内部结构的更多信息,请参阅英特尔的优化手册和 Agner Fog 的微架构 PDF.)

Multiple iterations of the inner loop can be in flight at once, because the memory-order buffer keeps track of which store each load needs to take data from, without requiring a previous store to the same location to commit to L1D and get out of the store queue. (See Intel's optimization manual and Agner Fog's microarch PDF for more about CPU microarchitecture internals.)

一般来说,不,不会.编译器将循环变量保存在最内层循环的寄存器中.启用优化后,无用的语句实际上会被优化掉.

In general, no, it doesn't. Compilers keep loop variables in registers for the innermost loops. And useless statements will actually optimize away with optimization enabled.

调整 gcc -O0 的源代码毫无用处. 使用 -O3 或项目默认构建脚本使用的任何选项进行测量.

Tuning your source for gcc -O0 is useless. Measure with -O3, or whatever options the default build scripts for your project use.

此外,这种存储转发加速特定于英特尔 Sandybridge 系列,您不会在其他微架构(如 Ryzen)上看到它,除非它们也具有类似的存储转发延迟效果.

Also, this store-forwarding speedup is specific to Intel Sandybridge-family, and you won't see it on other microarchitectures like Ryzen, unless they also have a similar store-forwarding latency effect.

存储转发延迟可能是实际(优化的)编译器输出中的一个问题,特别是如果您没有使用链接时间优化 (LTO) 来让小函数内联,尤其是那些通过引用传递或返回任何内容(因此它必须通过内存而不是寄存器).如果您真的只想在 Intel CPU 上解决这个问题,并且可能在其他一些 CPU 上使事情变得更糟,那么缓解这个问题可能需要像 volatile 这样的技巧.请参阅评论中的讨论

Store-forwarding latency can be a problem in real (optimized) compiler output, especially if you didn't use link-time-optimization (LTO) to let tiny functions inline, especially functions that pass or return anything by reference (so it has to go through memory instead of registers). Mitigating the problem may require hacks like volatile if you really want to just work around it on Intel CPUs and maybe make things worse on some other CPUs. See discussion in comments

这篇关于添加冗余分配可在未经优化的情况下编译时加快代码速度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆