gcc优化标志-O3使代码变慢,然后-O2 [英] gcc optimization flag -O3 makes code slower then -O2

查看:183
本文介绍了gcc优化标志-O3使代码变慢,然后-O2的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我发现这个话题为什么处理已排序的数组比未排序的数组更快?。并尝试运行此代码。我发现奇怪的行为。如果我使用 -O3 优化标记编译此代码,则需要运行 2.98605秒。如果我使用 -O2 进行编译,则需要 1.98093秒。我尝试在同一个环境中的同一台机器上多次运行这个代码(5或6),我关闭了所有其他软件(chrome,skype等)。


  gcc --version 
gcc(Ubuntu 4.9.2-0ubuntu1〜14.04)4.9.2
版权所有(C)2014 Free Software Foundation,Inc.
这是免费软件;请参阅复制条件的来源。没有任何b $ b保修;甚至不适用于适销性或针对特定用途的适用性。

所以请你能向我解释为什么会发生这种情况?我读了 gcc 手册,我发现 -O3 包含 -O2 。感谢您的帮助。



PS 添加代码

  #include< algorithm> 
#include< ctime>
#include< iostream>

int main()
{
//生成数据
const unsigned arraySize = 32768;
int data [arraySize]; (unsigned c = 0; c< arraySize; ++ c)
data [c] = std :: rand()%256;

;

// !!!有了这个,下一个循环运行得更快
std :: sort(data,data + arraySize);

//测试
clock_t start = clock();
long long sum = 0; (unsigned i = 0; i <100000; ++ i)

//主循环
(unsigned c = 0; c< ; arraySize; ++ c)
{
if(data [c]> = 128)
sum + = data [c]
}
}

double elapsedTime = static_cast< double>(clock() - start)/ CLOCKS_PER_SEC;

std :: cout<< elapsedTime<<的std :: ENDL;
std :: cout<< sum =<<总和<<的std :: ENDL;


解决方案

gcc -O3 对条件使用 cmov ,所以它延长了循环运行的依赖关系链包含一个 cmov (根据 Agner Fog的指令表。另见 x86 标记wiki)。这是 cmov 糟糕的情况之一。如果数据甚至是不可预测的,那么 cmov 可能会赢,所以这个对编译器来说是一个相当明智的选择。 (但是,编译器有时会使用无分支代码



把你的代码放在Godbolt编译器资源管理器中来查看asm(用不错的突出显示和筛选出不相关的线条。尽管如此,你仍然需要向下滚动所有排序代码才能到达main())。

  .L82:#the从gcc的内循环-O3 
movsx rcx,DWORD PTR [rdx]#符号扩展数据加载[c]
mov rsi,rcx
add rcx,rbx#rcx = sum + data [c]
cmp esi,127
cmovg rbx,rcx#sum = data [c]> 127? rcx:sum
add rdx,4#pointer-increment
cmp r12,rdx
jne .L82

gcc可以使用LEA而不是ADD保存MOV。



> CMOV(3循环),因为循环的一次迭代用CMO写rbx,下一次迭代用ADD读rbx。

循环只包含8个融合域uops,所以它可以每2个周期发一次。执行端口压力也不像 sum dep链的延迟那么严重,但它很接近(与Haswell的4不同,Sandybridge只有3个ALU端口)。



顺便说一句,写成 sum + =(data [c]> 128 = data [c]:0); cmov 从循环运行的dep链中取出可能很有用。仍然有很多指令,但每个迭代中的 cmov 是独立的。这按预期在gcc6.3 -O2 及更早的版本>中编译,但gcc7取消优化为 cmov 关键路径( https://gcc.gnu.org/的bugzilla / show_bug.cgi?ID = 82666 )。 (它也会自动使用早期的gcc版本进行矢量化,而不是使用 if()编写它的方式。)






gcc -O2 code>使用一个分支(对于gcc5.x和更旧的版本),这可以很好的预测,因为你的数据是排序的。由于现代CPU使用分支预测来处理控制依赖性,所以循环携带的依赖关系链更短:只有 add (1个周期的延迟)。

每次迭代的比较分支都是独立的,这要归功于分支预测+推测执行,它可以在分支方向确定之前继续执行。 / p>

  .L83:#gcc的内循环-O2 
movsx rcx,DWORD PTR [rdx]#用符号加载从int32扩展到int64
cmp ecx,127
jle .L82#有条件地跳过下一条指令
add rbp,rcx#sum + = data [c]
.L82 :
add rdx,4
cmp rbx,rdx
jne .L83



有两个循环运行的依赖链: sum 和循环计数器。 sum 是0或1个周期长,并且循环计数器总是1个周期长。然而,这个循环在Sandybridge上是5个融合域的uops,所以无论如何它不能在每次迭代1c时执行,所以延迟不是瓶颈。



它可能每2个周期运行一次(瓶颈分支指令吞吐量),而-O3循环每3个周期运行一次。下一个瓶颈将是ALU uop吞吐量:4个ALU uops(在未采用的情况下),但只有3个ALU端口。 (ADD可以在任何端口上运行)。

这个流水线分析预测几乎与你的时间约3秒相匹配--O3与〜2秒。对于-O2。






Haswell / Skylake可以在每1.25个周期运行一次未采取的情况,因为它可以在与采取的分支相同的周期内执行未采用的分支,并具有4个ALU端口。 (或稍微少一点,因为一个5 uop循环在每个周期的4个uops上都不是问题。)



(刚测试过:Skylake @ 3.9GHz运行分支版本的整个程序在1.45s,或无网点版本在1.68s,所以它们之间的差距要小得多。)




g ++ 6.3.1使用 cmov 即使在 -O2 中,但g ++ 5.4仍然像4.9.2一样。

使用g ++ 6.3.1和g ++ 5.4,使用 -fprofile-generate / -fprofile-use 即使在 -O3 (带有 -fno-tree-vectorize)时也会生成分支版本)。



来自新gcc的循环的CMOV版本使用 add ecx,-128 / cmovge rbx,rdx 代替CMP / CMOV。这有点奇怪,但可能不会减慢速度。 ADD写入输出寄存器以及标志,因此对物理寄存器的数量产生更大的压力。但只要这不是瓶颈,它应该是大致相同的。






新的gcc自动将循环向量化为 - O3,即使只有SSE2,这也是显着的加速。 (例如,我的i7-6700k Skylake在0.74s运行矢量化版本
,所以速度是标量的两倍,或者在0.35s内运行 -O3 -march = native ,使用AVX2 256b矢量)。


矢量化版本看起来像很多指令,但并不算太坏,其中大部分都不是循环的一部分,携带了dep链。它只需在接近尾声时解压缩到64位元素。不过,它的确执行了 pcmpgtd 两次,因为它没有意识到当条件已经将所有负整数归零时,它可以只是零扩展而不是符号扩展。


I find this topic Why is it faster to process a sorted array than an unsorted array? . And try to run this code. And I find strange behavior. If I compile this code with -O3 optimization flag it takes 2.98605 sec to run. If I compile with -O2 it takes 1.98093 sec. I try to run this code several times(5 or 6) on the same machine in the same environment, I close all other software(chrome, skype etc).

gcc --version
gcc (Ubuntu 4.9.2-0ubuntu1~14.04) 4.9.2
Copyright (C) 2014 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

So please can you explain to me why this happens? I read gcc manual and I see that -O3 includes -O2. Thank you for help.

P.S. add code

#include <algorithm>
#include <ctime>
#include <iostream>

int main()
{
    // Generate data
    const unsigned arraySize = 32768;
    int data[arraySize];

    for (unsigned c = 0; c < arraySize; ++c)
        data[c] = std::rand() % 256;

    // !!! With this, the next loop runs faster
    std::sort(data, data + arraySize);

    // Test
    clock_t start = clock();
    long long sum = 0;

    for (unsigned i = 0; i < 100000; ++i)
    {
        // Primary loop
        for (unsigned c = 0; c < arraySize; ++c)
        {
            if (data[c] >= 128)
                sum += data[c];
        }
    }

    double elapsedTime = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;

    std::cout << elapsedTime << std::endl;
    std::cout << "sum = " << sum << std::endl;
}

解决方案

gcc -O3 uses a cmov for the conditional, so it lengthens the loop-carried dependency chain to include a cmov (which is 2 uops and 2 cycles of latency on your Intel Sandybridge CPU, according to Agner Fog's instruction tables. See also the tag wiki). This is one of the cases where cmov sucks.

If the data was even moderately unpredictable, cmov would probably be a win, so this is a fairly sensible choice for a compiler to make. (However, compilers may sometimes use branchless code too much.)

I put your code on the Godbolt compiler explorer to see the asm (with nice highlighting and filtering out irrelevant lines. You still have to scroll down past all the sort code to get to main(), though).

.L82:  # the inner loop from gcc -O3
    movsx   rcx, DWORD PTR [rdx]  # sign-extending load of data[c]
    mov     rsi, rcx
    add     rcx, rbx        # rcx = sum+data[c]
    cmp     esi, 127
    cmovg   rbx, rcx        # sum = data[c]>127 ? rcx : sum
    add     rdx, 4          # pointer-increment
    cmp     r12, rdx
    jne     .L82

gcc could have saved the MOV by using LEA instead of ADD.

The loop bottlenecks on the latency of ADD->CMOV (3 cycles), since one iteration of the loop writes rbx with CMO, and the next iteration reads rbx with ADD.

The loop only contains 8 fused-domain uops, so it can issue at one per 2 cycles. Execution-port pressure is also not as bad a bottleneck as the latency of the sum dep chain, but it's close (Sandybridge only has 3 ALU ports, unlike Haswell's 4).

BTW, writing it as sum += (data[c] >= 128 ? data[c] : 0); to take the cmov out of the loop-carried dep chain is potentially useful. Still lots of instructions, but the cmov in each iteration is independent. This compiles as expected in gcc6.3 -O2 and earlier, but gcc7 de-optimizes into a cmov on the critical path (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82666). (It also auto-vectorizes with earlier gcc versions than the if() way of writing it.)

Clang takes the cmov off the critical path even with the original source.


gcc -O2 uses a branch (for gcc5.x and older), which predicts well because your data is sorted. Since modern CPUs use branch-prediction to handle control dependencies, the loop-carried dependency chain is shorter: just an add (1 cycle latency).

The compare-and-branch in every iteration is independent, thanks to branch-prediction + speculative execution, which lets execution continue before the branch direction is known for sure.

.L83:   # The inner loop from gcc -O2
    movsx   rcx, DWORD PTR [rdx]  # load with sign-extension from int32 to int64
    cmp     ecx, 127
    jle     .L82        # conditional-jump over the next instruction 
    add     rbp, rcx    # sum+=data[c]
.L82:
    add     rdx, 4
    cmp     rbx, rdx
    jne     .L83

There are two loop-carried dependency chains: sum and the loop-counter. sum is 0 or 1 cycle long, and the loop-counter is always 1 cycle long. However, the loop is 5 fused-domain uops on Sandybridge, so it can't execute at 1c per iteration anyway, so latency isn't a bottleneck.

It probably runs at about one iteration per 2 cycles (bottlenecked on branch instruction throughput), vs. one per 3 cycles for the -O3 loop. The next bottleneck would be ALU uop throughput: 4 ALU uops (in the not-taken case) but only 3 ALU ports. (ADD can run on any port).

This pipeline-analysis prediction matches pretty much exactly with your timings of ~3 sec for -O3 vs. ~2 sec for -O2.


Haswell/Skylake could run the not-taken case at one per 1.25 cycles, since it can execute a not-taken branch in the same cycle as a taken branch and has 4 ALU ports. (Or slightly less since a 5 uop loop doesn't quite issue at 4 uops every cycle).

(Just tested: Skylake @ 3.9GHz runs the branchy version of the whole program in 1.45s, or the branchless version in 1.68s. So the difference is much smaller there.)


g++6.3.1 uses cmov even at -O2, but g++5.4 still behaves like 4.9.2.

With both g++6.3.1 and g++5.4, using -fprofile-generate / -fprofile-use produces the branchy version even at -O3 (with -fno-tree-vectorize).

The CMOV version of the loop from newer gcc uses add ecx,-128 / cmovge rbx,rdx instead of CMP/CMOV. That's kinda weird, but probably doesn't slow it down. ADD writes an output register as well as flags, so creates more pressure on the number of physical registers. But as long as that's not a bottleneck, it should be about equal.


Newer gcc auto-vectorizes the loop with -O3, which is a significant speedup even with just SSE2. (e.g. my i7-6700k Skylake runs the vectorized version in 0.74s, so about twice as fast as scalar. Or -O3 -march=native in 0.35s, using AVX2 256b vectors).

The vectorized version looks like a lot of instructions, but it's not too bad, and most of them aren't part of a loop-carried dep chain. It only has to unpack to 64-bit elements near the end. It does pcmpgtd twice, though, because it doesn't realize it could just zero-extend instead of sign-extend when the condition has already zeroed all negative integers.

这篇关于gcc优化标志-O3使代码变慢,然后-O2的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆