std :: vector性能回归启用C ++ 11时 [英] std::vector performance regression when enabling C++11
问题描述
当我启用C ++ 11时,我在一个小的C ++代码段中发现了一个有趣的性能回归:
#include < vector>
struct Item
{
int a;
int b;
};
int main()
{
const std :: size_t num_items = 10000000;
std :: vector< Item>容器;
container.reserve(num_items);
for(std :: size_t i = 0; i container.push_back(Item());
}
return 0;使用g ++(GCC)4.8.2 20131219(prerelease)和C ++(C ++) 03我得到: milian:/ tmp $ g ++ -O3 main.cpp&& perf stat -r 10 ./a.out
'./a.out'(10次运行)的性能计数器统计信息:
35.206824 task-clock#0.988 CPUs (+ - 1.23%)
4上下文切换#0.116 K /秒(+ - 4.38%)
0 cpu迁移#0.006 K /秒(+ - 66.67%)
849页 - 故障#0.024 M /秒(+ - 6.02%)
95,693,808个周期#2.718 GHz(+ - 1.14%)[49.72%]
<不支持> stalled-cycles-frontend
<不支持> stalled-cycles-backend
95,282,359说明#1.00每周期insns(+ - 0.65%)[75.27%]
30,104,021分支#855.062 M /秒(+ - 0.87%)[77.46%]
6,038 branch-misses#所有分支的0.02%(+ - 25.73%)[75.53%]
0.035648729秒经过时间(+ - 1.22%)
另一方面,启用C ++ 11后,性能会显着降低:
milian:/ tmp $ g ++ -std = c ++ 11 -O3 main.cpp&& perf stat -r 10 ./a.out
'./a.out'(10次运行)的性能计数器统计信息:
86.485313 task-clock#0.994 CPUs (+ - 0.50%)
9上下文切换#0.104 K /秒(+ - 1.66%)
2 cpu迁移#0.017 K /秒(+ - 26.76%)
798页 - 故障#0.009M /秒(+ - 8.54%)
237,982,690个周期#2.752 GHz(+ - 0.41%)[51.32%]
<不支持> stalled-cycles-frontend
<不支持> stalled-cycles-backend
135,730,319说明#0.57 insns每周期(+ - 0.32%)[75.77%]
30,880,156分行#357.057 M /秒(+ - 0.25%)[75.76%]
4,188 branch-misses#所有分支的0.01%(+ - 7.59%)[74.08%]
0.087016724秒已用时间(+ - 0.50%)
有人可以解释一下吗?到目前为止我的经验是,STL通过启用C ++ 11,esp更快。感谢移动语义。
编辑:建议使用 container.emplace_back();
相反,性能与C ++ 03版本相当。 C ++ 03版本如何实现 push_back
?
milian:/ tmp $ g ++ -std = c ++ 11 -O3 main.cpp&& perf stat -r 10 ./a.out
'./a.out'(10次运行)的性能计数器统计信息:
36.229348 task-clock#0.988 CPUs (+ - 0.81%)
4上下文切换#0.116 K /秒(+ - 3.17%)
1 cpu迁移#0.017 K /秒(+ - 36.85%)
798页-faults#0.022 M / sec(+ - 8.54%)
94,488,818周期#2.608 GHz(+ - 1.11%)[50.44%]
<不支持> stalled-cycles-frontend
<不支持> stalled-cycles-backend
94,851,411 instruction#1.00 insns per cycle(+ - 0.98%)[75.22%]
30,468,562分行#840.991 M /秒(+ - 1.07%)[76.71%]
2,723 branch-misses#所有分支的0.01%(+ - 9.84%)[74.81%]
0.036678068秒经过时间(+ - 0.80%)
解决方案我可以使用您在帖子中写入的选项在我的机器上重现结果。
但是,如果我还启用链接时间优化(我也将 -flto
标志传递给gcc 4.7.2),结果是相同的:
(我正在编译您的原始代码, container.push_back(Item());
)
$ g ++ -std = c ++ 11 -O3 -flto regr.cpp&& perf stat -r 10 ./a.out
'./a.out'(10次运行)的性能计数器统计信息:
35.426793 task-clock#0.986 CPUs (+ - 1.75%)
4上下文切换#0.116 K /秒(+ - 5.69%)
0 CPU迁移#0.006 K /秒(+ - 66.67%)
19,801页-faults#0.559 M / sec
99,028,466周期#2.795 GHz(+ - 1.89%)[77.53%]
50,721,061 stalled-cycles-frontend#51.22%前端周期空闲(+ - 3.74%)[79.47 %
25,585,331 stalled-cycles-backend#25.84%后端循环空闲(+ - 4.90%)[73.07%]
141,947,224指令#每个循环有1.43个insns
#0.36每个insn + - 0.52%)[88.72%]
37,697,368个分支机构#1064.092 M /秒(+ - 0.52%)[88.75%]
26,700分支机构#所有分支机构的0.07%(+ - 3.91% [83.64%]
0.035943226秒经过时间(+ - 1.79%)
$ g ++ -std = c ++ 98 -O3 - flto regr.cpp&& perf stat -r 10 ./a.out
'./a.out'(10次运行)的性能计数器统计信息:
35.510495 task-clock#0.988 CPUs (+ - 2.54%)
4上下文切换#0.101 K /秒(+ - 7.41%)
0 CPU迁移#0.003 K /秒(+ -100.00%)
19,801页 - 故障#0.558 M / sec(+ - 0.00%)
98,463,570周期#2.773 GHz(+ - 1.09%)[77.71%]
50,079,978 stalled- - 2.20%)[79.41%]
26,270,699 stalled-cycles-backend#26.68%后端周期空闲(+ - 8.91%)[74.43%]
141,427,211个指令每周期1.44个insns
#每个insn有0.35个停滞周期(+ - 0.23%)[87.66%]
37,366,375个分支#1052.263 M / sec(+ - 0.48%)[88.61%]
26,621 branch-misses# (+ - 5.28%)[83.26%]
0.035953916秒经过时间
至于原因,需要看看生成的汇编代码( g ++ -std = c ++ 11 -O3 -S regr.cpp
)。 在C ++ 11模式下,生成的代码比C ++ 98模式和内置函数明显更混乱。
void std :: vector< Item,std :: allocator< Item>> :: _ M_emplace_back_aux< Item>(Item&& strong>在C ++ 11模式中,默认值为 inline-limit
。
这个失败的inline有一个多米诺效应。不是因为这个函数被调用
因为我们必须准备:如果它被调用,
函数argments( Item.a
和 Item.b
)必须已在正确的位置。这会导致
a很乱码。
下面是内联成功的情况下生成的代码的相关部分:/ p>
.L42:
testq%rbx,%rbx#container $ D13376 $ _M_impl $ _M_finish
je。 L3#,
movl $ 0,(%rbx)#,container $ D13376 $ _M_impl $ _M_finish_136-> a
movl $ 0,4(%rbx)#,container $ D13376 $ _M_impl $ _M_finish_136-> ; b
.L3:
addq $ 8,%rbx#,container $ D13376 $ _M_impl $ _M_finish
subq $ 1,%rbp#,ivtmp.106
je .L41#
.L14:
cmpq%rbx,%rdx#container $ D13376 $ _M_impl $ _M_finish,container $ D13376 $ _M_impl $ _M_end_of_storage
jne .L42#,
这是一个不错,紧凑的for循环。现在,让我们比较失败的内联情况:
.L49:
testq%rax,%rax#D.15772
je .L26#,
movq 16(%rsp),%rdx#D.13379,D.13379
movq%rdx, %rax)#D.13379,* D.15772_60
.L26:
addq $ 8,%rax#,tmp75
subq $ 1,%rbx#,ivtmp.117
movq %rax,40(%rsp)#tmp75,container.D.13376._M_impl._M_finish
je .L48#,
.L28:
movq 40(%rsp) container.D.13376._M_impl._M_finish,D.15772
cmpq 48(%rsp),%rax#container.D.13376._M_impl._M_end_of_storage,D.15772
movl $ 0,16(% rsp)#,D.13379.a
movl $ 0,20(%rsp)#,D.13379.b
jne .L49#,
leaq 16(%rsp),%rsi #,
leaq 32(%rsp),%rdi#,
call _ZNSt6vectorI4ItemSaIS0_EE19_M_emplace_back_auxIIS0_EEEvDpOT_#
这段代码很杂乱,循环中的代码比前面的例子要多。
在函数调用
(最后一行显示)之前,参数必须适当放置:
leaq 16(%rsp),%rsi#,
leaq 32(%rsp),%rdi#,
call _ZNSt6vectorI4ItemSaIS0_EE19_M_emplace_back_auxIIS0_EEEvDpOT_#
即使这从来没有真正执行,循环会在之前安排事情:
movl $ 0,16(%rsp)#,D.13379.a
movl $ 0,20(%rsp)#,D.13379.b
这会导致乱码。如果没有函数 code>因为内联成功,
我们在循环中只有2个移动指令,并且没有乱序与%rsp
(堆栈指针)。但是,如果内联失败,我们会得到6个移动,并且我们用%rsp
麻烦了。
在C ++ 11模式下,证明我的理论(注意 -finline-limit
):
$ g ++ -std = c ++ 11 -O3 -finline-limit = 105 regr.cpp&& perf stat -r 10 ./a.out
'./a.out'(10次运行)的性能计数器统计信息:
84.739057 task-clock#0.993 CPU使用率(+ - 1.34%)
8上下文切换#0.096 K /秒(+ - 2.22%)
1 CPU迁移#0.009 K /秒(+ - 64.01%)
19,801页 - 故障#0.234 M / sec
266,809,312个周期#3.149 GHz(+ - 0.58%)[81.20%]
206,804,948 stalled-cycles-frontend#77.51% %b
129,078,683 stalled-cycles-backend#48.38%后端循环空闲(+ - 1.37%)[69.49%]
183,130,306指令每个周期0.69 insns
#1.13每个insn + - 0.85%)[85.35%]
38,759,720个分支机构#457.401 M /秒(+ - 0.29%)[85.43%]
24,527个分支机构#占所有分支机构的0.06%(+ - 2.66% [83.52%]
0.085359326秒已用时间(+ - 1.31%)
$ g ++ -std = c ++ 11 -O3 -finline-limit = 106 regr.cpp &&& perf stat -r 10 ./a.out
'./a.out'(10次运行)的性能计数器统计信息:
37.790325 task-clock#0.990 CPUs (+ - 2.06%)
4上下文切换#0.098 K /秒(+ - 5.77%)
0 CPU迁移#0.011 K /秒(+ - 55.28%)
19,801页 - 故障#0.524 M / sec
104,699,973周期#2.771 GHz(+ - 2.04%)[78.91%]
58,023,151 stalled-cycles-frontend#55.42% %
30,572,036 stalled-cycles-backend#29.20%后端周期空闲(+ - 5.31%)[71.40%]
140,669,773指令每周期1.34 insns
#0.41每个insn + - 1.40%)[88.14%]
38,117,067个分支机构#1008.646 M /秒(+ - 0.65%)[89.38%]
27,519分支机构#所有分支机构的0.07%(+ - 4.01% [86.16%]
0.038187580秒已用时间(+ - 2.05%)
实际上,如果我们要求编译器尝试更简单一点,以便内联该函数,性能的差异就会消失。
那么,这个故事带走了什么?这个失败的内联可能会花费你很多,你应该充分利用编译器的能力:我只能推荐链接时间优化。它显着提高了我的程序的性能(高达2.5倍)所有我需要做的是传递 -flto
标志。这是一个很好的交易! ;)
但是,我不建议用inline关键字来破坏你的代码;让编译器决定做什么。 (
很好的问题,+1!
I have found an interesting performance regression in a small C++ snippet, when I enable C++11:
#include <vector>
struct Item
{
int a;
int b;
};
int main()
{
const std::size_t num_items = 10000000;
std::vector<Item> container;
container.reserve(num_items);
for (std::size_t i = 0; i < num_items; ++i) {
container.push_back(Item());
}
return 0;
}
With g++ (GCC) 4.8.2 20131219 (prerelease) and C++03 I get:
milian:/tmp$ g++ -O3 main.cpp && perf stat -r 10 ./a.out
Performance counter stats for './a.out' (10 runs):
35.206824 task-clock # 0.988 CPUs utilized ( +- 1.23% )
4 context-switches # 0.116 K/sec ( +- 4.38% )
0 cpu-migrations # 0.006 K/sec ( +- 66.67% )
849 page-faults # 0.024 M/sec ( +- 6.02% )
95,693,808 cycles # 2.718 GHz ( +- 1.14% ) [49.72%]
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
95,282,359 instructions # 1.00 insns per cycle ( +- 0.65% ) [75.27%]
30,104,021 branches # 855.062 M/sec ( +- 0.87% ) [77.46%]
6,038 branch-misses # 0.02% of all branches ( +- 25.73% ) [75.53%]
0.035648729 seconds time elapsed ( +- 1.22% )
With C++11 enabled on the other hand, the performance degrades significantly:
milian:/tmp$ g++ -std=c++11 -O3 main.cpp && perf stat -r 10 ./a.out
Performance counter stats for './a.out' (10 runs):
86.485313 task-clock # 0.994 CPUs utilized ( +- 0.50% )
9 context-switches # 0.104 K/sec ( +- 1.66% )
2 cpu-migrations # 0.017 K/sec ( +- 26.76% )
798 page-faults # 0.009 M/sec ( +- 8.54% )
237,982,690 cycles # 2.752 GHz ( +- 0.41% ) [51.32%]
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
135,730,319 instructions # 0.57 insns per cycle ( +- 0.32% ) [75.77%]
30,880,156 branches # 357.057 M/sec ( +- 0.25% ) [75.76%]
4,188 branch-misses # 0.01% of all branches ( +- 7.59% ) [74.08%]
0.087016724 seconds time elapsed ( +- 0.50% )
Can someone explain this? So far my experience was that the STL gets faster by enabling C++11, esp. thanks to move semantics.
EDIT: As suggested, using container.emplace_back();
instead the performance gets on par with the C++03 version. How can the C++03 version achieve the same for push_back
?
milian:/tmp$ g++ -std=c++11 -O3 main.cpp && perf stat -r 10 ./a.out
Performance counter stats for './a.out' (10 runs):
36.229348 task-clock # 0.988 CPUs utilized ( +- 0.81% )
4 context-switches # 0.116 K/sec ( +- 3.17% )
1 cpu-migrations # 0.017 K/sec ( +- 36.85% )
798 page-faults # 0.022 M/sec ( +- 8.54% )
94,488,818 cycles # 2.608 GHz ( +- 1.11% ) [50.44%]
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
94,851,411 instructions # 1.00 insns per cycle ( +- 0.98% ) [75.22%]
30,468,562 branches # 840.991 M/sec ( +- 1.07% ) [76.71%]
2,723 branch-misses # 0.01% of all branches ( +- 9.84% ) [74.81%]
0.036678068 seconds time elapsed ( +- 0.80% )
解决方案 I can reproduce your results on my machine with those options you write in your post.
However, if I also enable link time optimization (I also pass the -flto
flag to gcc 4.7.2), the results are identical:
(I am compiling your original code, with container.push_back(Item());
)
$ g++ -std=c++11 -O3 -flto regr.cpp && perf stat -r 10 ./a.out
Performance counter stats for './a.out' (10 runs):
35.426793 task-clock # 0.986 CPUs utilized ( +- 1.75% )
4 context-switches # 0.116 K/sec ( +- 5.69% )
0 CPU-migrations # 0.006 K/sec ( +- 66.67% )
19,801 page-faults # 0.559 M/sec
99,028,466 cycles # 2.795 GHz ( +- 1.89% ) [77.53%]
50,721,061 stalled-cycles-frontend # 51.22% frontend cycles idle ( +- 3.74% ) [79.47%]
25,585,331 stalled-cycles-backend # 25.84% backend cycles idle ( +- 4.90% ) [73.07%]
141,947,224 instructions # 1.43 insns per cycle
# 0.36 stalled cycles per insn ( +- 0.52% ) [88.72%]
37,697,368 branches # 1064.092 M/sec ( +- 0.52% ) [88.75%]
26,700 branch-misses # 0.07% of all branches ( +- 3.91% ) [83.64%]
0.035943226 seconds time elapsed ( +- 1.79% )
$ g++ -std=c++98 -O3 -flto regr.cpp && perf stat -r 10 ./a.out
Performance counter stats for './a.out' (10 runs):
35.510495 task-clock # 0.988 CPUs utilized ( +- 2.54% )
4 context-switches # 0.101 K/sec ( +- 7.41% )
0 CPU-migrations # 0.003 K/sec ( +-100.00% )
19,801 page-faults # 0.558 M/sec ( +- 0.00% )
98,463,570 cycles # 2.773 GHz ( +- 1.09% ) [77.71%]
50,079,978 stalled-cycles-frontend # 50.86% frontend cycles idle ( +- 2.20% ) [79.41%]
26,270,699 stalled-cycles-backend # 26.68% backend cycles idle ( +- 8.91% ) [74.43%]
141,427,211 instructions # 1.44 insns per cycle
# 0.35 stalled cycles per insn ( +- 0.23% ) [87.66%]
37,366,375 branches # 1052.263 M/sec ( +- 0.48% ) [88.61%]
26,621 branch-misses # 0.07% of all branches ( +- 5.28% ) [83.26%]
0.035953916 seconds time elapsed
As for the reasons, one needs to look at the generated assembly code (g++ -std=c++11 -O3 -S regr.cpp
). In C++11 mode the generated code is significantly more cluttered than for C++98 mode and inlining the function
void std::vector<Item,std::allocator<Item>>::_M_emplace_back_aux<Item>(Item&&)
fails in C++11 mode with the default inline-limit
.
This failed inline has a domino effect. Not because this function is being called
(it is not even called!) but because we have to be prepared: If it is called,
the function argments (Item.a
and Item.b
) must already be at the right place. This leads to
a pretty messy code.
Here is the relevant part of the generated code for the case where inlining succeeds:
.L42:
testq %rbx, %rbx # container$D13376$_M_impl$_M_finish
je .L3 #,
movl $0, (%rbx) #, container$D13376$_M_impl$_M_finish_136->a
movl $0, 4(%rbx) #, container$D13376$_M_impl$_M_finish_136->b
.L3:
addq $8, %rbx #, container$D13376$_M_impl$_M_finish
subq $1, %rbp #, ivtmp.106
je .L41 #,
.L14:
cmpq %rbx, %rdx # container$D13376$_M_impl$_M_finish, container$D13376$_M_impl$_M_end_of_storage
jne .L42 #,
This is a nice and compact for loop. Now, let's compare this to that of the failed inline case:
.L49:
testq %rax, %rax # D.15772
je .L26 #,
movq 16(%rsp), %rdx # D.13379, D.13379
movq %rdx, (%rax) # D.13379, *D.15772_60
.L26:
addq $8, %rax #, tmp75
subq $1, %rbx #, ivtmp.117
movq %rax, 40(%rsp) # tmp75, container.D.13376._M_impl._M_finish
je .L48 #,
.L28:
movq 40(%rsp), %rax # container.D.13376._M_impl._M_finish, D.15772
cmpq 48(%rsp), %rax # container.D.13376._M_impl._M_end_of_storage, D.15772
movl $0, 16(%rsp) #, D.13379.a
movl $0, 20(%rsp) #, D.13379.b
jne .L49 #,
leaq 16(%rsp), %rsi #,
leaq 32(%rsp), %rdi #,
call _ZNSt6vectorI4ItemSaIS0_EE19_M_emplace_back_auxIIS0_EEEvDpOT_ #
This code is cluttered and there is a lot more going on in the loop than in the previous case.
Before the function call
(last line shown), the arguments must be placed appropriately:
leaq 16(%rsp), %rsi #,
leaq 32(%rsp), %rdi #,
call _ZNSt6vectorI4ItemSaIS0_EE19_M_emplace_back_auxIIS0_EEEvDpOT_ #
Even though this is never actually executed, the loop arranges the things before:
movl $0, 16(%rsp) #, D.13379.a
movl $0, 20(%rsp) #, D.13379.b
This leads to the messy code. If there is no function call
because inlining succeeds,
we have only 2 move instructions in the loop and there is no messing going with the %rsp
(stack pointer). However, if the inlining fails, we get 6 moves and we mess a lot with the %rsp
.
Just to substantiate my theory (note the -finline-limit
), both in C++11 mode:
$ g++ -std=c++11 -O3 -finline-limit=105 regr.cpp && perf stat -r 10 ./a.out
Performance counter stats for './a.out' (10 runs):
84.739057 task-clock # 0.993 CPUs utilized ( +- 1.34% )
8 context-switches # 0.096 K/sec ( +- 2.22% )
1 CPU-migrations # 0.009 K/sec ( +- 64.01% )
19,801 page-faults # 0.234 M/sec
266,809,312 cycles # 3.149 GHz ( +- 0.58% ) [81.20%]
206,804,948 stalled-cycles-frontend # 77.51% frontend cycles idle ( +- 0.91% ) [81.25%]
129,078,683 stalled-cycles-backend # 48.38% backend cycles idle ( +- 1.37% ) [69.49%]
183,130,306 instructions # 0.69 insns per cycle
# 1.13 stalled cycles per insn ( +- 0.85% ) [85.35%]
38,759,720 branches # 457.401 M/sec ( +- 0.29% ) [85.43%]
24,527 branch-misses # 0.06% of all branches ( +- 2.66% ) [83.52%]
0.085359326 seconds time elapsed ( +- 1.31% )
$ g++ -std=c++11 -O3 -finline-limit=106 regr.cpp && perf stat -r 10 ./a.out
Performance counter stats for './a.out' (10 runs):
37.790325 task-clock # 0.990 CPUs utilized ( +- 2.06% )
4 context-switches # 0.098 K/sec ( +- 5.77% )
0 CPU-migrations # 0.011 K/sec ( +- 55.28% )
19,801 page-faults # 0.524 M/sec
104,699,973 cycles # 2.771 GHz ( +- 2.04% ) [78.91%]
58,023,151 stalled-cycles-frontend # 55.42% frontend cycles idle ( +- 4.03% ) [78.88%]
30,572,036 stalled-cycles-backend # 29.20% backend cycles idle ( +- 5.31% ) [71.40%]
140,669,773 instructions # 1.34 insns per cycle
# 0.41 stalled cycles per insn ( +- 1.40% ) [88.14%]
38,117,067 branches # 1008.646 M/sec ( +- 0.65% ) [89.38%]
27,519 branch-misses # 0.07% of all branches ( +- 4.01% ) [86.16%]
0.038187580 seconds time elapsed ( +- 2.05% )
Indeed, if we ask the compiler to try just a little bit harder to inline that function, the difference in performance goes away.
So what is the take away from this story? That failed inlines can cost you a lot and you should make full use of the compiler capabilities: I can only recommend link time optimization. It gave a significant performance boost to my programs (up to 2.5x) and all I needed to do is to pass the -flto
flag. That's a pretty good deal! ;)
However, I do not recommend trashing your code with the inline keyword; let the compiler decide what to do. (The optimizer is allowed to treat the inline keyword as white space anyway.)
Great question, +1!
这篇关于std :: vector性能回归启用C ++ 11时的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!