为什么C ++ std :: max_element这么慢? [英] why is c++ std::max_element so slow?
问题描述
我需要在向量中找到max元素,所以我正在使用std::max_element
,但是我发现它是一个非常慢的函数,因此我编写了自己的版本并设法获得x3更好的性能,这是代码:
I need to find the max element in the vector so I'm using std::max_element
, but I've found that it's a very slow function, so I wrote my own version and manage to get x3 better performance, here is the code:
#include <string>
#include <iostream>
#include <vector>
#include <algorithm>
#include <sys/time.h>
double getRealTime()
{
struct timeval tv;
gettimeofday(&tv, 0);
return (double) tv.tv_sec + 1.0e-6 * (double) tv.tv_usec;
}
inline int my_max_element(const std::vector<int> &vec, int size)
{
auto it = vec.begin();
int max = *it++;
for (; it != vec.end(); it++)
{
if (*it > max)
{
max = *it;
}
}
return max;
}
int main()
{
const int size = 1 << 20;
std::vector<int> vec;
for (int i = 0; i < size; i++)
{
if (i == 59)
{
vec.push_back(1000000012);
}
else
{
vec.push_back(i);
}
}
double startTime = getRealTime();
int maxIter = *std::max_element(vec.begin(), vec.end());
double stopTime = getRealTime();
double totalIteratorTime = stopTime - startTime;
startTime = getRealTime();
int maxArray = my_max_element(vec, size);
stopTime = getRealTime();
double totalArrayTime = stopTime - startTime;
std::cout << "MaxIter = " << maxIter << std::endl;
std::cout << "MaxArray = " << maxArray << std::endl;
std::cout << "Total CPU time iterator = " << totalIteratorTime << std::endl;
std::cout << "Total CPU time array = " << totalArrayTime << std::endl;
std::cout << "iter/array ratio: = " << totalIteratorTime / totalArrayTime << std::endl;
return 0;
}
输出:
MaxIter = 1000000012
MaxArray = 1000000012
Total CPU time iterator = 0.000989199
Total CPU time array = 0.000293016
iter/array ratio: = 3.37592
平均std::max_element
花的时间比my_max_element
多x3.
那么,为什么我能这么容易地创建一个更快的std函数呢?我应该停止使用std并编写自己的函数,因为std太慢了吗?
on average std::max_element
takes x3 more time then my_max_element
.
So why am I able to create a much faster std function so easily? Should I stop using std and write my own functions since std is so slow?
注意:起初我是因为是因为我在for循环中使用了整数i
而不是迭代器,但这对现在来说已经无关紧要了.
Note: at first I though it was because I'm using and integer i
in a for loop instead of an iterator, but that seams to not matter now.
编译信息:
g ++(GCC)4.8.2
g++ (GCC) 4.8.2
g ++ -O3-墙-c -fmessage-length = 0 -std = c ++ 0x
g++ -O3 -Wall -c -fmessage-length=0 -std=c++0x
推荐答案
在对此答案进行投票之前,请在您的计算机上进行测试(并验证)并评论/添加结果.请注意,我在测试中使用的矢量大小为1000 * 1000 * 1000.目前,此答案有19个投票,但只有一个发布的结果,这些结果未显示出下面描述的效果(尽管使用不同的测试代码获得,请参阅注释).
Before voting on this answer, please test (and verify) this on your machine and comment/add the results. Note that I used a vector size of 1000*1000*1000 for my tests. Currently, this answer has 19 upvotes but only one posted results, and these results did not show the effect described below (though obtained with a different test code, see comments).
似乎存在优化程序错误/工件.比较以下时间:
There seems to be an optimizer bug/artifact. Compare the times of:
template<typename _ForwardIterator, typename _Compare>
_ForwardIterator
my_max_element_orig(_ForwardIterator __first, _ForwardIterator __last,
_Compare __comp)
{
if (__first == __last) return __first;
_ForwardIterator __result = __first;
while(++__first != __last)
if (__comp(__result, __first))
__result = __first;
return __result;
}
template<typename _ForwardIterator, typename _Compare>
_ForwardIterator
my_max_element_changed(_ForwardIterator __first, _ForwardIterator __last,
_Compare __comp)
{
if (__first == __last) return __first;
_ForwardIterator __result = __first;
++__first;
for(; __first != __last; ++__first)
if (__comp(__result, __first))
__result = __first;
return __result;
}
第一个是原始libstdc ++实现,第二个应该是对行为或要求没有任何更改的转换. Clang ++对于这两个函数产生的运行时间非常相似,而第二个版本的g ++ 4.8.2快四倍.
The first is the original libstdc++ implementation, the second one should be a transformation without any changes in behaviour or requirements. Clang++ produces very similar run times for those two functions, whereas g++4.8.2 is four times faster with the second version.
按照Maxim的建议,将向量从int
更改为int64_t
,更改后的版本不是4,而仅是原始版本(g ++ 4.8.2)的1.7倍.
Following Maxim's proposal, changing the vector from int
to int64_t
, the changed version is not 4, but only 1.7 times faster than the original version (g++4.8.2).
区别在于*result
的预测通用性,即存储当前max元素的值,这样就不必每次都从内存中重新加载它.这提供了一种更简洁的缓存访问模式:
The difference is in predictive commoning of *result
, that is, storing the value of the current max element so that it does not have to be reloaded from memory each time. This gives a far cleaner cache access pattern:
w/o commoning with commoning
* *
** *
** *
** *
* * *
* * *
* * *
以下是用于比较的asm(rdi
/rsi
分别包含第一个/最后一个迭代器):
Here's the asm for comparison (rdi
/rsi
contain the first/last iterators respectively):
使用while循环(2.88743毫秒; 要点):
With the while loop (2.88743 ms; gist):
movq %rdi, %rax
jmp .L49
.L51:
movl (%rdi), %edx
cmpl %edx, (%rax)
cmovl %rdi, %rax
.L49:
addq $4, %rdi
cmpq %rsi, %rdi
jne .L51
使用for循环(1235.55μs):
With the for loop (1235.55 μs):
leaq 4(%rdi), %rdx
movq %rdi, %rax
cmpq %rsi, %rdx
je .L53
movl (%rdi), %ecx
.L54:
movl (%rdx), %r8d
cmpl %r8d, %ecx
cmovl %rdx, %rax
cmovl %r8d, %ecx
addq $4, %rdx
cmpq %rdx, %rsi
jne .L54
.L53:
如果我通过在开始时将*result
明确地存储到变量prev
中并每次更新result
并在比较中使用prev
而不是*result
来强行实现通用,那么我得到的循环更快(377.601μs):
If I force commoning by explicitly storing *result
into a variable prev
at the start and whenever result
is updated, and using prev
instead of *result
in the comparison, I get an even faster loop (377.601 μs):
movl (%rdi), %ecx
movq %rdi, %rax
.L57:
addq $4, %rdi
cmpq %rsi, %rdi
je .L60
.L59:
movl (%rdi), %edx
cmpl %edx, %ecx
jge .L57
movq %rdi, %rax
addq $4, %rdi
movl %edx, %ecx
cmpq %rsi, %rdi
jne .L59
.L60:
之所以比for
循环快,是因为上述条件移动(cmovl
)很少执行( H n 次,这是可以忽略的比例(H n 对数增长,因此H n /n迅速接近0).条件移动代码只会在病理数据上更好,例如[1、0、3、2、5、4 ...].
The reason this is faster than the for
loop is that the conditional moves (cmovl
) in the above are a pessimisation as they are executed so rarely (Linus says that cmov is only a good idea if the branch is unpredictable). Note that for randomly distributed data the branch is expected to be taken Hn times, which is a negligible proportion (Hn grows logarithmically, so Hn/n rapidly approaches 0). The conditional-move code will only be better on pathological data e.g. [1, 0, 3, 2, 5, 4, ...].
这篇关于为什么C ++ std :: max_element这么慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!