gcc优化标志-O3使代码慢于-O2 [英] gcc optimization flag -O3 makes code slower than -O2

查看:120
本文介绍了gcc优化标志-O3使代码慢于-O2的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我找到了这个主题为什么处理排序数组比未排序数组更快?.并尝试运行此代码.而且我发现奇怪的行为.如果使用-O3优化标志编译此代码,则需要2.98605 sec才能运行.如果我使用-O2进行编译,则需要1.98093 sec.我尝试在同一环境中的同一台计算机上多次(5或6)运行此代码,我关闭了所有其他软件(chrome,skype等).

I find this topic Why is it faster to process a sorted array than an unsorted array? . And try to run this code. And I find strange behavior. If I compile this code with -O3 optimization flag it takes 2.98605 sec to run. If I compile with -O2 it takes 1.98093 sec. I try to run this code several times(5 or 6) on the same machine in the same environment, I close all other software(chrome, skype etc).

gcc --version
gcc (Ubuntu 4.9.2-0ubuntu1~14.04) 4.9.2
Copyright (C) 2014 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

所以请您向我解释为什么会这样?我阅读了gcc手册,发现-O3包括-O2.谢谢您的帮助.

So please can you explain to me why this happens? I read gcc manual and I see that -O3 includes -O2. Thank you for help.

PS 添加代码

#include <algorithm>
#include <ctime>
#include <iostream>

int main()
{
    // Generate data
    const unsigned arraySize = 32768;
    int data[arraySize];

    for (unsigned c = 0; c < arraySize; ++c)
        data[c] = std::rand() % 256;

    // !!! With this, the next loop runs faster
    std::sort(data, data + arraySize);

    // Test
    clock_t start = clock();
    long long sum = 0;

    for (unsigned i = 0; i < 100000; ++i)
    {
        // Primary loop
        for (unsigned c = 0; c < arraySize; ++c)
        {
            if (data[c] >= 128)
                sum += data[c];
        }
    }

    double elapsedTime = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;

    std::cout << elapsedTime << std::endl;
    std::cout << "sum = " << sum << std::endl;
}

推荐答案

gcc -O3使用 cmov 作为有条件的,因此它延长了循环承载的依赖关系链以包括cmov(根据 cmov很烂的情况之一.

gcc -O3 uses a cmov for the conditional, so it lengthens the loop-carried dependency chain to include a cmov (which is 2 uops and 2 cycles of latency on your Intel Sandybridge CPU, according to Agner Fog's instruction tables. See also the x86 tag wiki). This is one of the cases where cmov sucks.

如果数据甚至是中等不可预测的,则cmov可能会是一个胜利,因此对于编译器来说,这是一个相当明智的选择. (但是,编译器有时可能会过多使用无分支代码.)

If the data was even moderately unpredictable, cmov would probably be a win, so this is a fairly sensible choice for a compiler to make. (However, compilers may sometimes use branchless code too much.)

I put your code on the Godbolt compiler explorer to see the asm (with nice highlighting and filtering out irrelevant lines. You still have to scroll down past all the sort code to get to main(), though).

.L82:  # the inner loop from gcc -O3
    movsx   rcx, DWORD PTR [rdx]  # sign-extending load of data[c]
    mov     rsi, rcx
    add     rcx, rbx        # rcx = sum+data[c]
    cmp     esi, 127
    cmovg   rbx, rcx        # sum = data[c]>127 ? rcx : sum
    add     rdx, 4          # pointer-increment
    cmp     r12, rdx
    jne     .L82

gcc可以通过使用LEA而不是ADD保存MOV.

gcc could have saved the MOV by using LEA instead of ADD.

循环的瓶颈在于ADD-> CMOV(3个周期)的延迟,因为循环的一次迭代使用CMO写入rbx,而下一次迭代使用ADD读取rbx.

The loop bottlenecks on the latency of ADD->CMOV (3 cycles), since one iteration of the loop writes rbx with CMO, and the next iteration reads rbx with ADD.

该循环仅包含8个融合域uops,因此它可以每2个周期发出一次.执行端口压力也没有sum dep链的延迟那样严重的瓶颈,但它是接近的(与Haswell的4个端口不同,Sandybridge仅具有3个ALU端口).

The loop only contains 8 fused-domain uops, so it can issue at one per 2 cycles. Execution-port pressure is also not as bad a bottleneck as the latency of the sum dep chain, but it's close (Sandybridge only has 3 ALU ports, unlike Haswell's 4).

BTW,将其写为sum += (data[c] >= 128 ? data[c] : 0);以将cmov从循环传送的dep链中取出可能是有用的.仍然有很多指令,但是每个迭代中的cmov是独立的.此 c可以在gcc6.3 -O2和更早版本中按预期方式进行编译,但gcc7在关键路径上会取消优化为cmov(

BTW, writing it as sum += (data[c] >= 128 ? data[c] : 0); to take the cmov out of the loop-carried dep chain is potentially useful. Still lots of instructions, but the cmov in each iteration is independent. This compiles as expected in gcc6.3 -O2 and earlier, but gcc7 de-optimizes into a cmov on the critical path (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82666). (It also auto-vectorizes with earlier gcc versions than the if() way of writing it.)

即使使用原始来源,Clang也使cmov脱离了关键路径.

Clang takes the cmov off the critical path even with the original source.

gcc -O2使用分支(适用于gcc5.x及更早版本),因为数据已排序,因此可以很好地预测.由于现代CPU使用分支预测来处理控件依赖性,因此循环承载的依赖性链更短:仅为add(1个周期延迟).

gcc -O2 uses a branch (for gcc5.x and older), which predicts well because your data is sorted. Since modern CPUs use branch-prediction to handle control dependencies, the loop-carried dependency chain is shorter: just an add (1 cycle latency).

由于分支预测和推测性执行,每次迭代中的比较分支都是独立的,因此可以在确定分支方向之前继续执行.

The compare-and-branch in every iteration is independent, thanks to branch-prediction + speculative execution, which lets execution continue before the branch direction is known for sure.

.L83:   # The inner loop from gcc -O2
    movsx   rcx, DWORD PTR [rdx]  # load with sign-extension from int32 to int64
    cmp     ecx, 127
    jle     .L82        # conditional-jump over the next instruction 
    add     rbp, rcx    # sum+=data[c]
.L82:
    add     rdx, 4
    cmp     rbx, rdx
    jne     .L83

有两个循环承载的依赖链:sum和循环计数器. sum为0或1个周期长,并且循环计数器始终为1个周期长.但是,该循环是Sandybridge上的5个融合域uops,因此无论如何每次迭代都不能以1c的速度执行,因此延迟不是瓶颈.

There are two loop-carried dependency chains: sum and the loop-counter. sum is 0 or 1 cycle long, and the loop-counter is always 1 cycle long. However, the loop is 5 fused-domain uops on Sandybridge, so it can't execute at 1c per iteration anyway, so latency isn't a bottleneck.

它可能大约每2个周期运行一次迭代(瓶颈在分支指令吞吐量上),而-O3循环则每3个周期运行一次.下一个瓶颈将是ALU uop吞吐量:4个ALU uops(在未采用的情况下),但只有3个ALU端口. (ADD可以在任何端口上运行.)

It probably runs at about one iteration per 2 cycles (bottlenecked on branch instruction throughput), vs. one per 3 cycles for the -O3 loop. The next bottleneck would be ALU uop throughput: 4 ALU uops (in the not-taken case) but only 3 ALU ports. (ADD can run on any port).

这种流水线分析预测与-O3〜3秒和-O2〜2秒的时间非常吻合.

Haswell/Skylake可以以每1.25个周期运行一个未采用情况,因为它可以在与采用分支相同的周期中执行一个未采用分支,并且具有4个ALU端口. (或略少于自一个5 uop循环在每个周期4 ups时就不会出现问题).

Haswell/Skylake could run the not-taken case at one per 1.25 cycles, since it can execute a not-taken branch in the same cycle as a taken branch and has 4 ALU ports. (Or slightly less since a 5 uop loop doesn't quite issue at 4 uops every cycle).

(刚刚测试:Skylake @ 3.9GHz在1.45s内运行整个程序的分支版本,在1.68s内运行无分支版本.因此,两者之间的差别要小得多.)

(Just tested: Skylake @ 3.9GHz runs the branchy version of the whole program in 1.45s, or the branchless version in 1.68s. So the difference is much smaller there.)

g ++ 6.3.1即使在-O2上也使用cmov,但g ++ 5.4的行为仍类似于4.9.2.

g++6.3.1 uses cmov even at -O2, but g++5.4 still behaves like 4.9.2.

对于g ++ 6.3.1和g ++ 5.4,使用-fprofile-generate/-fprofile-use都可以生成分支版本,即使在-O3(使用-fno-tree-vectorize).

With both g++6.3.1 and g++5.4, using -fprofile-generate / -fprofile-use produces the branchy version even at -O3 (with -fno-tree-vectorize).

来自较新gcc的循环的CMOV版本使用add ecx,-128/cmovge rbx,rdx而不是CMP/CMOV.有点奇怪,但可能不会使其变慢. ADD写入输出寄存器以及标志,因此对物理寄存器的数量产生了更大的压力.但是,只要这不是瓶颈,就应该差不多.

The CMOV version of the loop from newer gcc uses add ecx,-128 / cmovge rbx,rdx instead of CMP/CMOV. That's kinda weird, but probably doesn't slow it down. ADD writes an output register as well as flags, so creates more pressure on the number of physical registers. But as long as that's not a bottleneck, it should be about equal.

较新的gcc使用-O3自动对循环进行矢量化,即使仅使用SSE2,这也显着提高了速度. (例如,我的i7-6700k Skylake运行矢量化版本 在0.74s内,大约是标量的两倍.或使用AVX2 256b向量在0.35秒内显示-O3 -march=native.

Newer gcc auto-vectorizes the loop with -O3, which is a significant speedup even with just SSE2. (e.g. my i7-6700k Skylake runs the vectorized version in 0.74s, so about twice as fast as scalar. Or -O3 -march=native in 0.35s, using AVX2 256b vectors).

向量化版本看起来像很多指令,但是还算不错,而且大多数不是循环的dep链的一部分.它只需要在末尾解压缩到64位元素即可.但是,它会执行两次pcmpgtd,因为当条件已经将所有负整数都清零时,它没有意识到只能进行零扩展而不是符号扩展.

The vectorized version looks like a lot of instructions, but it's not too bad, and most of them aren't part of a loop-carried dep chain. It only has to unpack to 64-bit elements near the end. It does pcmpgtd twice, though, because it doesn't realize it could just zero-extend instead of sign-extend when the condition has already zeroed all negative integers.

这篇关于gcc优化标志-O3使代码慢于-O2的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆