std :: vector性能回归启用C ++ 11时 [英] std::vector performance regression when enabling C++11

查看：199 发布时间：2016/10/14 22:36:19 c++ performance gcc c++11 vector

本文介绍了std :: vector性能回归启用C ++ 11时的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

当我启用C ++ 11时，我在一个小的C ++代码段中发现了一个有趣的性能回归：

  #include < vector> 
 
 struct Item 
 {
 int a; 
 int b; 
}; 
 
 int main（）
 {
 const std :: size_t num_items = 10000000; 
 std :: vector< Item>容器; 
 container.reserve（num_items）; 
 for（std :: size_t i = 0; i  container.push_back（Item（））; 
} 
 return 0;使用g ++（GCC）4.8.2 20131219（prerelease）和C ++（C ++） 03我得到：
  milian：/ tmp $ g ++ -O3 main.cpp&& perf stat -r 10 ./a.out 
 
'./a.out'（10次运行）的性能计数器统计信息：
 
 35.206824 task-clock＃0.988 CPUs （+  -  1.23％）
 4上下文切换＃0.116 K /秒（+  -  4.38％）
 0 cpu迁移＃0.006 K /秒（+  -  66.67％）
 849页 - 故障＃0.024 M /秒（+  -  6.02％）
 95,693,808个周期＃2.718 GHz（+  -  1.14％）[49.72％] 
<不支持> stalled-cycles-frontend 
<不支持> stalled-cycles-backend 
 95,282,359说明＃1.00每周期insns（+  -  0.65％）[75.27％] 
 30,104,021分支＃855.062 M /秒（+  -  0.87％）[77.46％] 
 6,038 branch-misses＃所有分支的0.02％（+  -  25.73％）[75.53％] 
 
 0.035648729秒经过时间（+  -  1.22％）
  
 
 
 另一方面，启用C ++ 11后，性能会显着降低：
  milian：/ tmp $ g ++ -std = c ++ 11 -O3 main.cpp&& perf stat -r 10 ./a.out 
 
'./a.out'（10次运行）的性能计数器统计信息：
 
 86.485313 task-clock＃0.994 CPUs （+  -  0.50％）
 9上下文切换＃0.104 K /秒（+  -  1.66％）
 2 cpu迁移＃0.017 K /秒（+  -  26.76％）
 798页 - 故障＃0.009M /秒（+  -  8.54％）
 237,982,690个周期＃2.752 GHz（+  -  0.41％）[51.32％] 
<不支持> stalled-cycles-frontend 
<不支持> stalled-cycles-backend 
 135,730,319说明＃0.57 insns每周期（+  -  0.32％）[75.77％] 
 30,880,156分行＃357.057 M /秒（+  -  0.25％）[75.76％] 
 4,188 branch-misses＃所有分支的0.01％（+  -  7.59％）[74.08％] 
 
 0.087016724秒已用时间（+  -  0.50％）
  
有人可以解释一下吗？到目前为止我的经验是，STL通过启用C ++ 11，esp更快。感谢移动语义。
 
 
  编辑：建议使用 container.emplace_back（）; 相反，性能与C ++ 03版本相当。 C ++ 03版本如何实现 push_back ？
  milian：/ tmp $ g ++ -std = c ++ 11 -O3 main.cpp&& perf stat -r 10 ./a.out 
 
'./a.out'（10次运行）的性能计数器统计信息：
 
 36.229348 task-clock＃0.988 CPUs （+  -  0.81％）
 4上下文切换＃0.116 K /秒（+  -  3.17％）
 1 cpu迁移＃0.017 K /秒（+  -  36.85％）
 798页-faults＃0.022 M / sec（+  -  8.54％）
 94,488,818周期＃2.608 GHz（+  -  1.11％）[50.44％] 
<不支持> stalled-cycles-frontend 
<不支持> stalled-cycles-backend 
 94,851,411 instruction＃1.00 insns per cycle（+  -  0.98％）[75.22％] 
 30,468,562分行＃840.991 M /秒（+  -  1.07％）[76.71％] 
 2,723 branch-misses＃所有分支的0.01％（+  -  9.84％）[74.81％] 
 
 0.036678068秒经过时间（+  -  0.80％）
  
 
 
解决方案
我可以使用您在帖子中写入的选项在我的机器上重现结果。 
 
 
  但是，如果我还启用链接时间优化（我也将 -flto 标志传递给gcc 4.7.2），结果是相同的： 
 
 
 （我正在编译您的原始代码， container.push_back（Item（））; ）
  $ g ++ -std = c ++ 11 -O3 -flto regr.cpp&& perf stat -r 10 ./a.out 
 
'./a.out'（10次运行）的性能计数器统计信息：
 
 35.426793 task-clock＃0.986 CPUs （+  -  1.75％）
 4上下文切换＃0.116 K /秒（+  -  5.69％）
 0 CPU迁移＃0.006 K /秒（+  -  66.67％）
 19,801页-faults＃0.559 M / sec 
 99,028,466周期＃2.795 GHz（+  -  1.89％）[77.53％] 
 50,721,061 stalled-cycles-frontend＃51.22％前端周期空闲（+  -  3.74％）[79.47 ％
 25,585,331 stalled-cycles-backend＃25.84％后端循环空闲（+  -  4.90％）[73.07％] 
 141,947,224指令＃每个循环有1.43个insns 
＃0.36每个insn +  -  0.52％）[88.72％] 
 37,697,368个分支机构＃1064.092 M /秒（+  -  0.52％）[88.75％] 
 26,700分支机构＃所有分支机构的0.07％（+  -  3.91％ [83.64％] 
 
 0.035943226秒经过时间（+  -  1.79％）
 
 
 
 $ g ++ -std = c ++ 98 -O3  - flto regr.cpp&& perf stat -r 10 ./a.out 
 
'./a.out'（10次运行）的性能计数器统计信息：
 
 35.510495 task-clock＃0.988 CPUs （+  -  2.54％）
 4上下文切换＃0.101 K /秒（+  -  7.41％）
 0 CPU迁移＃0.003 K /秒（+ -100.00％）
 19,801页 - 故障＃0.558 M / sec（+  -  0.00％）
 98,463,570周期＃2.773 GHz（+  -  1.09％）[77.71％] 
 50,079,978 stalled- -  2.20％）[79.41％] 
 26,270,699 stalled-cycles-backend＃26.68％后端周期空闲（+  -  8.91％）[74.43％] 
 141,427,211个指令每周期1.44个insns 
＃每个insn有0.35个停滞周期（+  -  0.23％）[87.66％] 
 37,366,375个分支＃1052.263 M / sec（+  -  0.48％）[88.61％] 
 26,621 branch-misses＃ （+  -  5.28％）[83.26％] 
 
 0.035953916秒经过时间
  
至于原因，需要看看生成的汇编代码（ g ++ -std = c ++ 11 -O3 -S regr.cpp ）。 在C ++ 11模式下，生成的代码比C ++ 98模式和内置函数明显更混乱。
 
  void std :: vector< Item，std :: allocator< Item>> :: _ M_emplace_back_aux< Item>（Item&& strong>在C ++ 11模式中，默认值为 inline-limit 。 
  
 
  这个失败的inline有一个多米诺效应。不是因为这个函数被调用
因为我们必须准备：如果它被调用，
函数argments（ Item.a 和 Item.b ）必须已在正确的位置。这会导致
a很乱码。
 
 
 下面是内联成功的情况下生成的代码的相关部分： 
 
 
  .L42：
 testq％rbx，％rbx＃container $ D13376 $ _M_impl $ _M_finish 
 je。 L3＃，
 movl $ 0，（％rbx）＃，container $ D13376 $ _M_impl $ _M_finish_136-> a 
 movl $ 0,4（％rbx）＃，container $ D13376 $ _M_impl $ _M_finish_136-> ; b 
 .L3：
 addq $ 8，％rbx＃，container $ D13376 $ _M_impl $ _M_finish 
 subq $ 1，％rbp＃，ivtmp.106 
 je .L41＃ 
 .L14：
 cmpq％rbx，％rdx＃container $ D13376 $ _M_impl $ _M_finish，container $ D13376 $ _M_impl $ _M_end_of_storage 
 jne .L42＃，
  
这是一个不错，紧凑的for循环。现在，让我们比较失败的内联情况：
  .L49：
 testq％rax，％rax＃D.15772 
 je .L26＃，
 movq 16（％rsp），％rdx＃D.13379，D.13379 
 movq％rdx， ％rax）＃D.13379，* D.15772_60 
 .L26：
 addq $ 8，％rax＃，tmp75 
 subq $ 1，％rbx＃，ivtmp.117 
 movq ％rax，40（％rsp）＃tmp75，container.D.13376._M_impl._M_finish 
 je .L48＃，
 .L28：
 movq 40（％rsp） container.D.13376._M_impl._M_finish，D.15772 
 cmpq 48（％rsp），％rax＃container.D.13376._M_impl._M_end_of_storage，D.15772 
 movl $ 0，16（％ rsp）＃，D.13379.a 
 movl $ 0,20（％rsp）＃，D.13379.b 
 jne .L49＃，
 leaq 16（％rsp），％rsi ＃，
 leaq 32（％rsp），％rdi＃，
 call _ZNSt6vectorI4ItemSaIS0_EE19_M_emplace_back_auxIIS0_EEEvDpOT_＃
  
这段代码很杂乱，循环中的代码比前面的例子要多。 
在函数调用（最后一行显示）之前，参数必须适当放置：
  leaq 16（％rsp），％rsi＃，
 leaq 32（％rsp），％rdi＃，
 call _ZNSt6vectorI4ItemSaIS0_EE19_M_emplace_back_auxIIS0_EEEvDpOT_＃
  
即使这从来没有真正执行，循环会在之前安排事情：
  movl $ 0，16（％rsp）＃，D.13379.a 
 movl $ 0,20（％rsp）＃，D.13379.b 
  
 这会导致乱码。如果没有函数 code>因为内联成功，
我们在循环中只有2个移动指令，并且没有乱序与％rsp （堆栈指针）。但是，如果内联失败，我们会得到6个移动，并且我们用％rsp 麻烦了。
 
 
 在C ++ 11模式下，证明我的理论（注意 -finline-limit ）：
  $ g ++ -std = c ++ 11 -O3 -finline-limit = 105 regr.cpp&& perf stat -r 10 ./a.out 
 
'./a.out'（10次运行）的性能计数器统计信息：
 
 84.739057 task-clock＃0.993 CPU使用率（+  -  1.34％）
 8上下文切换＃0.096 K /秒（+  -  2.22％）
 1 CPU迁移＃0.009 K /秒（+  -  64.01％）
 19,801页 - 故障＃0.234 M / sec 
 266,809,312个周期＃3.149 GHz（+  -  0.58％）[81.20％] 
 206,804,948 stalled-cycles-frontend＃77.51％ ％b 
 129,078,683 stalled-cycles-backend＃48.38％后端循环空闲（+  -  1.37％）[69.49％] 
 183,130,306指令每个周期0.69 insns 
＃1.13每个insn +  -  0.85％）[85.35％] 
 38,759,720个分支机构＃457.401 M /秒（+  -  0.29％）[85.43％] 
 24,527个分支机构＃占所有分支机构的0.06％（+  -  2.66％ [83.52％] 
 
 0.085359326秒已用时间（+  -  1.31％）
 
 $ g ++ -std = c ++ 11 -O3 -finline-limit = 106 regr.cpp &&& perf stat -r 10 ./a.out 
 
'./a.out'（10次运行）的性能计数器统计信息：
 
 37.790325 task-clock＃0.990 CPUs （+  -  2.06％）
 4上下文切换＃0.098 K /秒（+  -  5.77％）
 0 CPU迁移＃0.011 K /秒（+  -  55.28％）
 19,801页 - 故障＃0.524 M / sec 
 104,699,973周期＃2.771 GHz（+  -  2.04％）[78.91％] 
 58,023,151 stalled-cycles-frontend＃55.42％ ％
 30,572,036 stalled-cycles-backend＃29.20％后端周期空闲（+  -  5.31％）[71.40％] 
 140,669,773指令每周期1.34 insns 
＃0.41每个insn +  -  1.40％）[88.14％] 
 38,117,067个分支机构＃1008.646 M /秒（+  -  0.65％）[89.38％] 
 27,519分支机构＃所有分支机构的0.07％（+  -  4.01％ [86.16％] 
 
 0.038187580秒已用时间（+  -  2.05％）
  
 实际上，如果我们要求编译器尝试更简单一点，以便内联该函数，性能的差异就会消失。 
 
 
 
 
 
 那么，这个故事带走了什么？这个失败的内联可能会花费你很多，你应该充分利用编译器的能力：我只能推荐链接时间优化。它显着提高了我的程序的性能（高达2.5倍）所有我需要做的是传递 -flto 标志。这是一个很好的交易！ ;）
 
 
 但是，我不建议用inline关键字来破坏你的代码;让编译器决定做什么。 （
 
 
 
 
 
 很好的问题，+1！
 
I have found an interesting performance regression in a small C++ snippet, when I enable C++11:
#include <vector>

struct Item
{
  int a;
  int b;
};

int main()
{
  const std::size_t num_items = 10000000;
  std::vector<Item> container;
  container.reserve(num_items);
  for (std::size_t i = 0; i < num_items; ++i) {
    container.push_back(Item());
  }
  return 0;
}
With g++ (GCC) 4.8.2 20131219 (prerelease) and C++03 I get:
milian:/tmp$ g++ -O3 main.cpp && perf stat -r 10 ./a.out

Performance counter stats for './a.out' (10 runs):

        35.206824 task-clock                #    0.988 CPUs utilized            ( +-  1.23% )
                4 context-switches          #    0.116 K/sec                    ( +-  4.38% )
                0 cpu-migrations            #    0.006 K/sec                    ( +- 66.67% )
              849 page-faults               #    0.024 M/sec                    ( +-  6.02% )
       95,693,808 cycles                    #    2.718 GHz                      ( +-  1.14% ) [49.72%]
  <not supported> stalled-cycles-frontend 
  <not supported> stalled-cycles-backend  
       95,282,359 instructions              #    1.00  insns per cycle          ( +-  0.65% ) [75.27%]
       30,104,021 branches                  #  855.062 M/sec                    ( +-  0.87% ) [77.46%]
            6,038 branch-misses             #    0.02% of all branches          ( +- 25.73% ) [75.53%]

      0.035648729 seconds time elapsed                                          ( +-  1.22% )
With C++11 enabled on the other hand, the performance degrades significantly:
milian:/tmp$ g++ -std=c++11 -O3 main.cpp && perf stat -r 10 ./a.out

Performance counter stats for './a.out' (10 runs):

        86.485313 task-clock                #    0.994 CPUs utilized            ( +-  0.50% )
                9 context-switches          #    0.104 K/sec                    ( +-  1.66% )
                2 cpu-migrations            #    0.017 K/sec                    ( +- 26.76% )
              798 page-faults               #    0.009 M/sec                    ( +-  8.54% )
      237,982,690 cycles                    #    2.752 GHz                      ( +-  0.41% ) [51.32%]
  <not supported> stalled-cycles-frontend 
  <not supported> stalled-cycles-backend  
      135,730,319 instructions              #    0.57  insns per cycle          ( +-  0.32% ) [75.77%]
       30,880,156 branches                  #  357.057 M/sec                    ( +-  0.25% ) [75.76%]
            4,188 branch-misses             #    0.01% of all branches          ( +-  7.59% ) [74.08%]

    0.087016724 seconds time elapsed                                          ( +-  0.50% )
Can someone explain this? So far my experience was that the STL gets faster by enabling C++11, esp. thanks to move semantics.

EDIT: As suggested, using container.emplace_back(); instead the performance gets on par with the C++03 version. How can the C++03 version achieve the same for push_back?
milian:/tmp$ g++ -std=c++11 -O3 main.cpp && perf stat -r 10 ./a.out

Performance counter stats for './a.out' (10 runs):

        36.229348 task-clock                #    0.988 CPUs utilized            ( +-  0.81% )
                4 context-switches          #    0.116 K/sec                    ( +-  3.17% )
                1 cpu-migrations            #    0.017 K/sec                    ( +- 36.85% )
              798 page-faults               #    0.022 M/sec                    ( +-  8.54% )
       94,488,818 cycles                    #    2.608 GHz                      ( +-  1.11% ) [50.44%]
  <not supported> stalled-cycles-frontend 
  <not supported> stalled-cycles-backend  
       94,851,411 instructions              #    1.00  insns per cycle          ( +-  0.98% ) [75.22%]
       30,468,562 branches                  #  840.991 M/sec                    ( +-  1.07% ) [76.71%]
            2,723 branch-misses             #    0.01% of all branches          ( +-  9.84% ) [74.81%]

   0.036678068 seconds time elapsed                                          ( +-  0.80% )

 解决方案 
I can reproduce your results on my machine with those options you write in your post. 

However, if I also enable link time optimization (I also pass the -flto flag to gcc 4.7.2), the results are identical:

(I am compiling your original code, with container.push_back(Item());)
$ g++ -std=c++11 -O3 -flto regr.cpp && perf stat -r 10 ./a.out 

 Performance counter stats for './a.out' (10 runs):

         35.426793 task-clock                #    0.986 CPUs utilized            ( +-  1.75% )
                 4 context-switches          #    0.116 K/sec                    ( +-  5.69% )
                 0 CPU-migrations            #    0.006 K/sec                    ( +- 66.67% )
            19,801 page-faults               #    0.559 M/sec                  
        99,028,466 cycles                    #    2.795 GHz                      ( +-  1.89% ) [77.53%]
        50,721,061 stalled-cycles-frontend   #   51.22% frontend cycles idle     ( +-  3.74% ) [79.47%]
        25,585,331 stalled-cycles-backend    #   25.84% backend  cycles idle     ( +-  4.90% ) [73.07%]
       141,947,224 instructions              #    1.43  insns per cycle        
                                             #    0.36  stalled cycles per insn  ( +-  0.52% ) [88.72%]
        37,697,368 branches                  # 1064.092 M/sec                    ( +-  0.52% ) [88.75%]
            26,700 branch-misses             #    0.07% of all branches          ( +-  3.91% ) [83.64%]

       0.035943226 seconds time elapsed                                          ( +-  1.79% )



$ g++ -std=c++98 -O3 -flto regr.cpp && perf stat -r 10 ./a.out 

 Performance counter stats for './a.out' (10 runs):

         35.510495 task-clock                #    0.988 CPUs utilized            ( +-  2.54% )
                 4 context-switches          #    0.101 K/sec                    ( +-  7.41% )
                 0 CPU-migrations            #    0.003 K/sec                    ( +-100.00% )
            19,801 page-faults               #    0.558 M/sec                    ( +-  0.00% )
        98,463,570 cycles                    #    2.773 GHz                      ( +-  1.09% ) [77.71%]
        50,079,978 stalled-cycles-frontend   #   50.86% frontend cycles idle     ( +-  2.20% ) [79.41%]
        26,270,699 stalled-cycles-backend    #   26.68% backend  cycles idle     ( +-  8.91% ) [74.43%]
       141,427,211 instructions              #    1.44  insns per cycle        
                                             #    0.35  stalled cycles per insn  ( +-  0.23% ) [87.66%]
        37,366,375 branches                  # 1052.263 M/sec                    ( +-  0.48% ) [88.61%]
            26,621 branch-misses             #    0.07% of all branches          ( +-  5.28% ) [83.26%]

       0.035953916 seconds time elapsed  
As for the reasons, one needs to look at the generated assembly code (g++ -std=c++11 -O3 -S regr.cpp). In C++11 mode the generated code is significantly more cluttered than for C++98 mode and inlining the function

void std::vector<Item,std::allocator<Item>>::_M_emplace_back_aux<Item>(Item&&)

fails in C++11 mode with the default inline-limit. 

This failed inline has a domino effect. Not because this function is being called 
(it is not even called!) but because we have to be prepared: If it is called,
the function argments (Item.a and Item.b) must already be at the right place. This leads to 
a pretty messy code.

Here is the relevant part of the generated code for the case where inlining succeeds:
.L42:
    testq   %rbx, %rbx  # container$D13376$_M_impl$_M_finish
    je  .L3 #,
    movl    $0, (%rbx)  #, container$D13376$_M_impl$_M_finish_136->a
    movl    $0, 4(%rbx) #, container$D13376$_M_impl$_M_finish_136->b
.L3:
    addq    $8, %rbx    #, container$D13376$_M_impl$_M_finish
    subq    $1, %rbp    #, ivtmp.106
    je  .L41    #,
.L14:
    cmpq    %rbx, %rdx  # container$D13376$_M_impl$_M_finish, container$D13376$_M_impl$_M_end_of_storage
    jne .L42    #,
This is a nice and compact for loop. Now, let's compare this to that of the failed inline case:
.L49:
    testq   %rax, %rax  # D.15772
    je  .L26    #,
    movq    16(%rsp), %rdx  # D.13379, D.13379
    movq    %rdx, (%rax)    # D.13379, *D.15772_60
.L26:
    addq    $8, %rax    #, tmp75
    subq    $1, %rbx    #, ivtmp.117
    movq    %rax, 40(%rsp)  # tmp75, container.D.13376._M_impl._M_finish
    je  .L48    #,
.L28:
    movq    40(%rsp), %rax  # container.D.13376._M_impl._M_finish, D.15772
    cmpq    48(%rsp), %rax  # container.D.13376._M_impl._M_end_of_storage, D.15772
    movl    $0, 16(%rsp)    #, D.13379.a
    movl    $0, 20(%rsp)    #, D.13379.b
    jne .L49    #,
    leaq    16(%rsp), %rsi  #,
    leaq    32(%rsp), %rdi  #,
    call    _ZNSt6vectorI4ItemSaIS0_EE19_M_emplace_back_auxIIS0_EEEvDpOT_   #
This code is cluttered and there is a lot more going on in the loop than in the previous case. 
Before the function call (last line shown), the arguments must be placed appropriately:
leaq    16(%rsp), %rsi  #,
leaq    32(%rsp), %rdi  #,
call    _ZNSt6vectorI4ItemSaIS0_EE19_M_emplace_back_auxIIS0_EEEvDpOT_   #
Even though this is never actually executed, the loop arranges the things before:
movl    $0, 16(%rsp)    #, D.13379.a
movl    $0, 20(%rsp)    #, D.13379.b
This leads to the messy code. If there is no function call because inlining succeeds, 
we have only 2 move instructions in the loop and there is no messing going with the %rsp (stack pointer). However, if the inlining fails, we get 6 moves and we mess a lot with the %rsp.

Just to substantiate my theory (note the -finline-limit), both in C++11 mode:
 $ g++ -std=c++11 -O3 -finline-limit=105 regr.cpp && perf stat -r 10 ./a.out

 Performance counter stats for './a.out' (10 runs):

         84.739057 task-clock                #    0.993 CPUs utilized            ( +-  1.34% )
                 8 context-switches          #    0.096 K/sec                    ( +-  2.22% )
                 1 CPU-migrations            #    0.009 K/sec                    ( +- 64.01% )
            19,801 page-faults               #    0.234 M/sec                  
       266,809,312 cycles                    #    3.149 GHz                      ( +-  0.58% ) [81.20%]
       206,804,948 stalled-cycles-frontend   #   77.51% frontend cycles idle     ( +-  0.91% ) [81.25%]
       129,078,683 stalled-cycles-backend    #   48.38% backend  cycles idle     ( +-  1.37% ) [69.49%]
       183,130,306 instructions              #    0.69  insns per cycle        
                                             #    1.13  stalled cycles per insn  ( +-  0.85% ) [85.35%]
        38,759,720 branches                  #  457.401 M/sec                    ( +-  0.29% ) [85.43%]
            24,527 branch-misses             #    0.06% of all branches          ( +-  2.66% ) [83.52%]

       0.085359326 seconds time elapsed                                          ( +-  1.31% )

 $ g++ -std=c++11 -O3 -finline-limit=106 regr.cpp && perf stat -r 10 ./a.out

 Performance counter stats for './a.out' (10 runs):

         37.790325 task-clock                #    0.990 CPUs utilized            ( +-  2.06% )
                 4 context-switches          #    0.098 K/sec                    ( +-  5.77% )
                 0 CPU-migrations            #    0.011 K/sec                    ( +- 55.28% )
            19,801 page-faults               #    0.524 M/sec                  
       104,699,973 cycles                    #    2.771 GHz                      ( +-  2.04% ) [78.91%]
        58,023,151 stalled-cycles-frontend   #   55.42% frontend cycles idle     ( +-  4.03% ) [78.88%]
        30,572,036 stalled-cycles-backend    #   29.20% backend  cycles idle     ( +-  5.31% ) [71.40%]
       140,669,773 instructions              #    1.34  insns per cycle        
                                             #    0.41  stalled cycles per insn  ( +-  1.40% ) [88.14%]
        38,117,067 branches                  # 1008.646 M/sec                    ( +-  0.65% ) [89.38%]
            27,519 branch-misses             #    0.07% of all branches          ( +-  4.01% ) [86.16%]

       0.038187580 seconds time elapsed                                          ( +-  2.05% )
Indeed, if we ask the compiler to try just a little bit harder to inline that function, the difference in performance goes away.



So what is the take away from this story? That failed inlines can cost you a lot and you should make full use of the compiler capabilities: I can only recommend link time optimization. It gave a significant performance boost to my programs (up to 2.5x) and all I needed to do is to pass the -flto flag. That's a pretty good deal! ;)

However, I do not recommend trashing your code with the inline keyword; let the compiler decide what to do. (The optimizer is allowed to treat the inline keyword as white space anyway.)



Great question, +1!

                        这篇关于std :: vector性能回归启用C ++ 11时的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

std :: vector性能回归启用C ++ 11时 [英] std::vector performance regression when enabling C++11

问题描述

相关文章

C/C++开发最新文章

热门教程

热门工具

登录关闭

std :: vector性能回归启用C ++ 11时 [英] std::vector performance regression when enabling C++11

问题描述

相关文章

C/C++开发最新文章

热门教程

热门工具

登录 关闭

登录关闭