为什么GCC生成速度更快的15%-20%code,如果我优化大小,而不是速度? [英] Why does gcc generate 15-20% faster code if I optimize for size instead of speed?

查看:1014
本文介绍了为什么GCC生成速度更快的15%-20%code,如果我优化大小,而不是速度?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在2009年首次​​发现,海湾合作委员会(至少在我的项目,并在我的机器)有可能产生明显快code,如果我优化的尺寸 -Os ),而不是速度( -O2 -O3 ),和我已经过疑惑,为什么起

我已成功创建(相当愚蠢的)code,显示这个令人惊讶的行为和足够小,以张贴在这里。

  const int的LOOP_BOUND = 2亿;__attribute __((noinline始终))
静态INT加(const int的&放大器; X,const int的&安培; Y){
    返回X + Y;
}__attribute __((noinline始终))
静态INT工作(INT XVAL,诠释利用yval){
    INT总和(0);
    的for(int i = 0; I< LOOP_BOUND ++我){
        INT X(XVAL +总和);
        INT Y(利用yval +总和);
        INT Z =添加(X,Y);
        总和+ = Z;
    }
    返回总和;
}INT主(INT,CHAR *的argv []){
    INT结果=工作(*的argv [1],* argv的[2]);
    返回结果;
}

如果我编译它与 -Os ,它需要0.38  s到执行这个程序,和0.44  S如果是用 -O2 或 -O3 。这些时间一致,并与几乎没有噪音(GCC 4.7.2,x86_64的GNU / Linux的英特尔酷睿i5-3320M)获得

(更新:我已经感动了所有组装code到 GitHub上:他们发的帖子臃肿显然很少增值问题作为 FNO对齐 - *。标记有同样的效果)

使用 -Os 生成的程序集和 -O2
不幸的是,我组装的了解是非常有限的,所以我不知道是否我并其次是正确的:我抓起大会 -O2 并合并其所有的差异进组装 -Os 除了的的 .p2align 线,结果的此处。这code仍然运行在0.38s和唯一不同的是 .p2align 的东西。

如果我猜得不错,这些都是栈对齐补白。据为什么用NOP指令GCC垫的功能呢?的它在希望code运行速度将更快完成,但显然这种优化事与愿违我的情况。

是否填充是在这种情况下的罪魁祸首?为什么和怎么样?

这让pretty多品牌时序微的优化不可能的噪声。

我怎样才能确保当我做C或C ++源$ C ​​$ C微型优化(无关栈对齐)这种偶然的幸运/不幸的路线不会产生干扰?


更新:

Cuoq的回答我修修补补一点点的路线。通过传递 -O2 -fno-ALIGN-功能-fno-ALIGN-循环来GCC,所有的 .p2align 都消失了从组件和0.38s生成的可执行文件运行。按照 GCC文档


  

-Os使所有-O2优化,[但] -Os禁用以下优化参数:

  -falign-功能-falign-跳跃-falign-循环< BR />
  -falign-标签-freorder块-freorder块和分区< BR />
  -f prefetch环阵列< BR />


因此,pretty很多看起来像一个(MIS)对齐问题。

我仍然怀疑 -march =本地萨芬杜汉的建议回答。我不相信这不仅仅是本(MIS)对齐问题干扰;它有我的机器上绝对没有任何影响。 (不过,我upvoted他的回答。)


更新2:

我们可以采取 -Os 出来的图片。下面的时间用

获得编译

  • -O2 -fno-省略帧指针 0.37s


  • -O2 -fno-ALIGN-功能-fno-ALIGN-循环 0.37s


  • -S -O2 然后手动移动)的装配添加(工作() 0.37s


  • -O2 0.44s


它看起来像我的的距离调用站点添加()是相当重要的。我曾尝试 PERF ,但的输出PERF统计 PERF的报告使得很少了意义。但是,我只能得到一个一致的结果出来的:

-O2

  602312864停滞周期,前端#0.00%的前端周期闲置
       3,318缓存缺失
 0.432703993秒时间已过
 [...]
 81.23%的a.out的a.out [。]工作(INT,INT)
 18.50%的a.out a.out的增加(INT常量和放大器;,INT常量和放大器;)[。] [克隆.isra.0]
 [...]
       | __attribute __((noinline始终))
       |静态INT增加(const int的&放大器; X,const int的&安培; Y){
       |返回X + Y;
100.00 | LEA(%RDI,RSI%,1),%EAX
       |}
       |? retq
[...]
       | INT Z =添加(X,Y);
  1.93 |? callq添加(INT常量和放大器;,INT常量和放大器;)[克隆.isra.0]
       |总和+ = Z;
 79.79 |添加%EAX,EBX%

有关 FNO对齐 - *

  604072552停滞周期,前端#0.00%的前端周期闲置
       9508缓存缺失
 0.375681928秒时间已过
 [...]
 82.58%的a.out的a.out [。]工作(INT,INT)
 16.83%的a.out a.out的增加(INT常量和放大器;,INT常量和放大器;)[。] [克隆.isra.0]
 [...]
       | __attribute __((noinline始终))
       |静态INT增加(const int的&放大器; X,const int的&安培; Y){
       |返回X + Y;
 51.59 | LEA(%RDI,RSI%,1),%EAX
       |}
[...]
       | __attribute __((noinline始终))
       |静态INT工作(INT XVAL,诠释利用yval){
       | INT总和(0);
       |的for(int i = 0; I< LOOP_BOUND ++我){
       | INT X(XVAL +总和);
  8.20 | LEA为0x0(%R13,%RBX,1),%EDI
       | INT Y(利用yval +总和);
       | INT Z =添加(X,Y);
 35.34 |? callq添加(INT常量和放大器;,INT常量和放大器;)[克隆.isra.0]
       |总和+ = Z;
 39.48 |添加%EAX,EBX%
       |}

有关 -fno-省略帧指针

  404625639停滞周期,前端#0.00%的前端周期闲置
      10514缓存缺失
 0.375445137秒时间已过
 [...]
 75.35%的a.out a.out的增加(INT常量和放大器;,INT常量和放大器;)[。] [克隆.isra.0] |
 24.46%的a.out的a.out [。]工作(INT,INT)
 [...]
       | __attribute __((noinline始终))
       |静态INT增加(const int的&放大器; X,const int的&安培; Y){
 18.67 |推%RBP
       |返回X + Y;
 18.49 | LEA(%RDI,RSI%,1),%EAX
       | const int的LOOP_BOUND = 2亿;
       |
       | __attribute __((noinline始终))
       |静态INT增加(const int的&放大器; X,const int的&安培; Y){
       | MOV%RSP,RBP%
       |返回X + Y;
       |}
 12.71 |流行%RBP
       |? retq
 [...]
       | INT Z =添加(X,Y);
       |? callq添加(INT常量和放大器;,INT常量和放大器;)[克隆.isra.0]
       |总和+ = Z;
 29.83 |添加%EAX,EBX%

看起来我们是在缓慢的情况下,调用添加()拖延。

我已经研究过的所有的是 PERF -e 可以吐出我的机器上;不只是在上面给出的统计数据。

对于同一个可执行文件,在停顿周期,前端显示与执行时间的线性相关;我没有看到其他任何会这么清楚相关。 (比较停顿周期,前端针对不同的可执行文件是没有道理给我。)

我包括高速缓存未命中,因为它提出了作为第一个注释。我检查了所有的高速缓存未命中,可在我的机器上按 PERF 来衡量,而不是上面给出的那些。该高速缓存未命中是非常非常嘈杂,很少表现与执行时间没有关系。


解决方案

我的同事帮我找到一个合理的回答我的问题。他注意到在256字节边界的重要性。他没有在这里注册,并鼓励我张贴自己的答案(并采取一切名利)。


简短的回答:


  

是它的填充是在这种情况下的罪魁祸首?为什么和怎么样?


这一切都归结为对齐路线会对性能有显著的影响,这就是为什么我们有 -falign - * 标志摆在首位。

我已经提交到GCC开发一个(假的?)错误报告。事实证明,默认行为是的我们默认对齐循环,以8个字节,但尝试将其调整为16字节,如果我们并不需要超过10个字节填充。的很明显,这个默认是不是在这种特殊情况下,我的机器上的最佳选择。锵3.4(主干)与 -O3 做相应的调整和产生的code没有显示此怪异的行为。

当然,如果不适当的调整完成后,它使事情变得更糟。一个不必要/坏对准刚吃没理由不字节,并潜在地增加高速缓存未命中,等等。


  

这让pretty多品牌时序微的优化噪声
  不可能的。


  
  

我怎样才能确保这种意外的幸运/不幸的路线
  当我做微优化(无关堆栈不产生干扰
  对齐)在C或C ++源$ C ​​$ CS?


简单地告诉GCC做正确的对齐方式:

G ++ -O2 -falign-功能= 16 -falign-循环= 16


龙答:

在code运行速度会变慢,如果:


  • XX 字节边界削减添加()在中间( XX 是依赖于机器)。


  • 如果调用添加()已跃过一个 XX 字节边界和目标是不是一致。


  • 如果添加()未对齐的。


  • 如果循环未对齐。


第2是在codeS精美可见和结果马拉特·杜汉亲切发布。在这种情况下, GCC-4.8.1 -Os (以0.994秒执行):

  00000000004004fd< _ZL3addRKiS0_.isra.0计算值:
  4004fd:8D 04 37 LEA EAX,[RDI + RSI * 1]
  400500:C3

256字节边界削减添加()就在中间,既不添加(),也没有循环对齐。惊喜,惊奇,这是最慢的情况下!

在案件 GCC-4.7.3 -Os (以0.822秒执行),256字节边界仅切入低温段(但无论是循环,也不添加()切断):

  00000000004004fa< _ZL3addRKiS0_.isra.0计算值:
  4004fa:8D 04 37 LEA EAX,[RDI + RSI * 1]
  4004fd:C3 RET[...]  40051a:E8分贝FF FF FF调用4004fa< _ZL3addRKiS0_.isra.0>

没有什么是对齐,以及调用添加()已跃过256字节边界。这code是第二慢的。

在案件 GCC-4.6.4 -Os (以0.709秒执行),尽管没有对齐,调用添加()没有跃过256字节边界,目标正是32字节远:

  4004f2:E8分贝FF FF FF调用4004d2< _ZL3addRKiS0_.isra.0>
  4004f7:01 C3增加EBX,EAX
  4004f9:FF关门减速EBP
  4004fb:75 EC JNE 4004e9< _ZL4workii + 0x13>

这是最快的所有三种。为什么256字节边界是他的机器上speacial,我会离开它由他来弄明白。我没有这样的处理器

现在,我的机器上我没有得到这256字节边界效应。只有功能和循环排列踢我的机器上。如果我通过 G ++ -O2 -falign-功能= 16 -falign-循环= 16 然后一切恢复正常:我总是最快的情况下,时间是不是在 -fno-省略帧指针标记敏感了。我可以通过 G ++ -O2 -falign-功能= 32 -falign-循环= 32 16或任何倍数时,code是不敏感两种。


  

我在2009年首次​​发现,海湾合作委员会(至少在我的项目和我的
  机器),如果有产生明显快code中的我倾向
  优化尺寸(-Os),而不是速度(-O2或-O3)和我一直
  曾经想知道为什么起


一个可能的解释是,我有这都在对准敏感,就像一个在这个例子中的热点。通过与标志搞乱(通过 -Os 而不是 -O2 ),这些热点是通过一个幸运的排列方式事故和code变得更快。 它无关,与规模优化:。这些都是由纯粹偶然的热点得到了更好的对齐从现在开始,我会检查我的项目调整的影响

哦,还有一件事。 如何能这样的热点出现,就像在例子中显示的?试问,这样一个小功能,比如内联加()失败?

考虑一下:

  // add.cpp
加INT(const int的&放大器; X,const int的&安培; Y){
    返回X + Y;
}

和在一个单独的文件:

  //的main.cpp
加INT(const int的&放大器; X,const int的&安培; Y);const int的LOOP_BOUND = 2亿;__attribute __((noinline始终))
静态INT工作(INT XVAL,诠释利用yval){
    INT总和(0);
    的for(int i = 0; I< LOOP_BOUND ++我){
        INT X(XVAL +总和);
        INT Y(利用yval +总和);
        INT Z =添加(X,Y);
        总和+ = Z;
    }
    返回总和;
}INT主(INT,CHAR *的argv []){
    INT结果=工作(*的argv [1],* argv的[2]);
    返回结果;
}

和编译为: G ++ -O2 add.cpp的main.cpp

      ! GCC将不会内联添加()

这一切,就这么简单unintendedly创造热点,如一个在OP。 当然,这部分是我的错:GCC是一个优秀的编译器如果编译上面为: G ++ -O2 -flto add.cpp的main.cpp ,即如果我执行链接时优化,code运行在0.19s!

(内联在OP人为禁用,因此,在OP的code是2倍速度较慢)。

I first noticed in 2009 that gcc (at least on my projects and on my machines) have the tendency to generate noticeably faster code if I optimize for size (-Os) instead of speed (-O2 or -O3), and I have been wondering ever since why.

I have managed to create (rather silly) code that shows this surprising behavior and is sufficiently small to be posted here.

const int LOOP_BOUND = 200000000;

__attribute__((noinline))
static int add(const int& x, const int& y) {
    return x + y;
}

__attribute__((noinline))
static int work(int xval, int yval) {
    int sum(0);
    for (int i=0; i<LOOP_BOUND; ++i) {
        int x(xval+sum);
        int y(yval+sum);
        int z = add(x, y);
        sum += z;
    }
    return sum;
}

int main(int , char* argv[]) {
    int result = work(*argv[1], *argv[2]);
    return result;
}

If I compile it with -Os, it takes 0.38 s to execute this program, and 0.44 s if it is compiled with -O2 or -O3. These times are obtained consistently and with practically no noise (gcc 4.7.2, x86_64 GNU/Linux, Intel Core i5-3320M).

(Update: I have moved all assembly code to GitHub: They made the post bloated and apparently add very little value to the questions as the fno-align-* flags have the same effect.)

The generated assembly with -Os and -O2. Unfortunately, my understanding of assembly is very limited, so I have no idea whether what I did next was correct: I grabbed the assembly for -O2 and merged all its differences into the assembly for -Os except the .p2align lines, result here. This code still runs in 0.38s and the only difference is the .p2align stuff.

If I guess correctly, these are paddings for stack alignment. According to Why does GCC pad functions with NOPs? it is done in the hope that the code will run faster, but apparently this optimization backfired in my case.

Is it the padding that is the culprit in this case? Why and how?

The noise it makes pretty much makes timing micro-optimizations impossible.

How can I make sure that such accidental lucky / unlucky alignments are not interfering when I do micro-optimizations (unrelated to stack alignment) on C or C++ source code?


UPDATE:

Following Pascal Cuoq's answer I tinkered a little bit with the alignments. By passing -O2 -fno-align-functions -fno-align-loops to gcc, all .p2align are gone from the assembly and the generated executable runs in 0.38s. According to the gcc documentation:

-Os enables all -O2 optimizations [but] -Os disables the following optimization flags:

  -falign-functions  -falign-jumps  -falign-loops <br/>
  -falign-labels  -freorder-blocks  -freorder-blocks-and-partition <br/>
  -fprefetch-loop-arrays <br/>

So, it pretty much seems like a (mis)alignment issue.

I am still skeptical about -march=native as suggested in Marat Dukhan's answer. I am not convinced that it isn't just interfering with this (mis)alignment issue; it has absolutely no effect on my machine. (Nevertheless, I upvoted his answer.)


UPDATE 2:

We can take -Os out of the picture. The following times are obtained by compiling with

  • -O2 -fno-omit-frame-pointer 0.37s

  • -O2 -fno-align-functions -fno-align-loops 0.37s

  • -S -O2 then manually moving the assembly of add() after work() 0.37s

  • -O2 0.44s

It looks like to me the distance of add() from the call site matters a lot. I have tried perf, but the output of perf stat and perf report makes very little sense to me. However, I could only get one consistent result out of it:

-O2:

 602,312,864 stalled-cycles-frontend   #    0.00% frontend cycles idle
       3,318 cache-misses
 0.432703993 seconds time elapsed
 [...]
 81.23%  a.out  a.out              [.] work(int, int)
 18.50%  a.out  a.out              [.] add(int const&, int const&) [clone .isra.0]
 [...]
       ¦   __attribute__((noinline))
       ¦   static int add(const int& x, const int& y) {
       ¦       return x + y;
100.00 ¦     lea    (%rdi,%rsi,1),%eax
       ¦   }
       ¦   ? retq
[...]
       ¦            int z = add(x, y);
  1.93 ¦    ? callq  add(int const&, int const&) [clone .isra.0]
       ¦            sum += z;
 79.79 ¦      add    %eax,%ebx

For fno-align-*:

 604,072,552 stalled-cycles-frontend   #    0.00% frontend cycles idle
       9,508 cache-misses
 0.375681928 seconds time elapsed
 [...]
 82.58%  a.out  a.out              [.] work(int, int)
 16.83%  a.out  a.out              [.] add(int const&, int const&) [clone .isra.0]
 [...]
       ¦   __attribute__((noinline))
       ¦   static int add(const int& x, const int& y) {
       ¦       return x + y;
 51.59 ¦     lea    (%rdi,%rsi,1),%eax
       ¦   }
[...]
       ¦    __attribute__((noinline))
       ¦    static int work(int xval, int yval) {
       ¦        int sum(0);
       ¦        for (int i=0; i<LOOP_BOUND; ++i) {
       ¦            int x(xval+sum);
  8.20 ¦      lea    0x0(%r13,%rbx,1),%edi
       ¦            int y(yval+sum);
       ¦            int z = add(x, y);
 35.34 ¦    ? callq  add(int const&, int const&) [clone .isra.0]
       ¦            sum += z;
 39.48 ¦      add    %eax,%ebx
       ¦    }

For -fno-omit-frame-pointer:

 404,625,639 stalled-cycles-frontend   #    0.00% frontend cycles idle
      10,514 cache-misses
 0.375445137 seconds time elapsed
 [...]
 75.35%  a.out  a.out              [.] add(int const&, int const&) [clone .isra.0]                                                                                     ¦
 24.46%  a.out  a.out              [.] work(int, int)
 [...]
       ¦   __attribute__((noinline))
       ¦   static int add(const int& x, const int& y) {
 18.67 ¦     push   %rbp
       ¦       return x + y;
 18.49 ¦     lea    (%rdi,%rsi,1),%eax
       ¦   const int LOOP_BOUND = 200000000;
       ¦
       ¦   __attribute__((noinline))
       ¦   static int add(const int& x, const int& y) {
       ¦     mov    %rsp,%rbp
       ¦       return x + y;
       ¦   }
 12.71 ¦     pop    %rbp
       ¦   ? retq
 [...]
       ¦            int z = add(x, y);
       ¦    ? callq  add(int const&, int const&) [clone .isra.0]
       ¦            sum += z;
 29.83 ¦      add    %eax,%ebx

It looks like we are stalling on the call to add() in the slow case.

I have examined everything that perf -e can spit out on my machine; not just the stats that are given above.

For the same executable, the stalled-cycles-frontend shows linear correlation with the execution time; I did not notice anything else that would correlate so clearly. (Comparing stalled-cycles-frontend for different executables doesn't make sense to me.)

I included the cache misses as it came up as the first comment. I examined all the cache misses that can be measured on my machine by perf, not just the ones given above. The cache misses are very very noisy and show little to no correlation with the execution times.

解决方案

My colleague helped me find a plausible answer to my question. He noticed the importance of the 256 byte boundary. He is not registered here and encouraged me to post the answer myself (and take all the fame).


Short answer:

Is it the padding that is the culprit in this case? Why and how?

It all boils down to alignment. Alignments can have a significant impact on the performance, that is why we have the -falign-* flags in the first place.

I have submitted a (bogus?) bug report to the gcc developers. It turns out that the default behavior is "we align loops to 8 byte by default but try to align it to 16 byte if we don't need to fill in over 10 bytes." Apparently, this default is not the best choice in this particular case and on my machine. Clang 3.4 (trunk) with -O3 does the appropriate alignment and the generated code does not show this weird behavior.

Of course, if an inappropriate alignment is done, it makes things worse. An unnecessary / bad alignment just eats up bytes for no reason and potentially increases cache misses, etc.

The noise it makes pretty much makes timing micro-optimizations impossible.

How can I make sure that such accidental lucky / unlucky alignments are not interfering when I do micro-optimizations (unrelated to stack alignment) on C or C++ source codes?

Simply by telling gcc to do the right alignment:

g++ -O2 -falign-functions=16 -falign-loops=16


Long answer:

The code will run slower if:

  • an XX byte boundary cuts add() in the middle (XX being machine dependent).

  • if the call to add() has to jump over an XX byte boundary and the target is not aligned.

  • if add() is not aligned.

  • if the loop is not aligned.

The first 2 are beautifully visible on the codes and results that Marat Dukhan kindly posted. In this case, gcc-4.8.1 -Os (executes in 0.994 secs):

00000000004004fd <_ZL3addRKiS0_.isra.0>:
  4004fd:       8d 04 37                lea    eax,[rdi+rsi*1]
  400500:       c3   

a 256 byte boundary cuts add() right in the middle and neither add() nor the loop is aligned. Surprise, surprise, this is the slowest case!

In case gcc-4.7.3 -Os (executes in 0.822 secs), the 256 byte boundary only cuts into a cold section (but neither the loop, nor add() is cut):

00000000004004fa <_ZL3addRKiS0_.isra.0>:
  4004fa:       8d 04 37                lea    eax,[rdi+rsi*1]
  4004fd:       c3                      ret

[...]

  40051a:       e8 db ff ff ff          call   4004fa <_ZL3addRKiS0_.isra.0>

Nothing is aligned, and the call to add() has to jump over the 256 byte boundary. This code is the second slowest.

In case gcc-4.6.4 -Os (executes in 0.709 secs), although nothing is aligned, the call to add() doesn't have to jump over the 256 byte boundary and the target is exactly 32 byte away:

  4004f2:       e8 db ff ff ff          call   4004d2 <_ZL3addRKiS0_.isra.0>
  4004f7:       01 c3                   add    ebx,eax
  4004f9:       ff cd                   dec    ebp
  4004fb:       75 ec                   jne    4004e9 <_ZL4workii+0x13>

This is the fastest of all three. Why the 256 byte boundary is speacial on his machine, I will leave it up to him to figure it out. I don't have such a processor.

Now, on my machine I don't get this 256 byte boundary effect. Only the function and the loop alignment kicks in on my machine. If I pass g++ -O2 -falign-functions=16 -falign-loops=16 then everything is back to normal: I always get the fastest case and the time isn't sensitive to the -fno-omit-frame-pointer flag anymore. I can pass g++ -O2 -falign-functions=32 -falign-loops=32 or any multiples of 16, the code is not sensitive to that either.

I first noticed in 2009 that gcc (at least on my projects and on my machines) have the tendency to generate noticeably faster code if I optimize for size (-Os) instead of speed (-O2 or -O3) and I have been wondering ever since why.

A likely explanation is that I had hotspots which were sensitive to the alignment, just like the one in this example. By messing with the flags (passing -Os instead of -O2), those hotspots were aligned in a lucky way by accident and the code became faster. It had nothing to do with optimizing for size: These were by sheer accident that the hotspots got aligned better. From now on, I will check the effects of alignment on my projects.

Oh, and one more thing. How can such hotspots arise, like the one shown in the example? How can the inlining of such a tiny function like add() fail?

Consider this:

// add.cpp
int add(const int& x, const int& y) {
    return x + y;
}

and in a separate file:

// main.cpp
int add(const int& x, const int& y);

const int LOOP_BOUND = 200000000;

__attribute__((noinline))
static int work(int xval, int yval) {
    int sum(0);
    for (int i=0; i<LOOP_BOUND; ++i) {
        int x(xval+sum);
        int y(yval+sum);
        int z = add(x, y);
        sum += z;
    }
    return sum;
}

int main(int , char* argv[]) {
    int result = work(*argv[1], *argv[2]);
    return result;
}

and compiled as: g++ -O2 add.cpp main.cpp.

      gcc won't inline add()!

That's all, it's that easy to unintendedly create hotspots like the one in the OP. Of course it is partly my fault: gcc is an excellent compiler. If compile the above as: g++ -O2 -flto add.cpp main.cpp, that is, if I perform link time optimization, the code runs in 0.19s!

(Inlining is artificially disabled in the OP, hence, the code in the OP was 2x slower).

这篇关于为什么GCC生成速度更快的15%-20%code,如果我优化大小,而不是速度?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆