为什么C ++ std :: max_element这么慢? [英] why is c++ std::max_element so slow?

查看：115 发布时间：2020/5/8 0:04:22 c++ gcc vector iterator max

本文介绍了为什么C ++ std :: max_element这么慢?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我需要在向量中找到max元素，所以我正在使用std::max_element，但是我发现它是一个非常慢的函数，因此我编写了自己的版本并设法获得x3更好的性能，这是代码:

I need to find the max element in the vector so I'm using std::max_element, but I've found that it's a very slow function, so I wrote my own version and manage to get x3 better performance, here is the code:

#include <string>
#include <iostream>
#include <vector>
#include <algorithm>

#include <sys/time.h>

double getRealTime()
{
    struct timeval tv;
    gettimeofday(&tv, 0);
    return (double) tv.tv_sec + 1.0e-6 * (double) tv.tv_usec;
}

inline int my_max_element(const std::vector<int> &vec, int size)
{
    auto it = vec.begin();
    int max = *it++;
    for (; it != vec.end(); it++)
    {
        if (*it > max)
        {
            max = *it;
        }
    }
    return max;
}

int main()
{
    const int size = 1 << 20;
    std::vector<int> vec;
    for (int i = 0; i < size; i++)
    {
        if (i == 59)
        {
            vec.push_back(1000000012);
        }
        else
        {
            vec.push_back(i);
        }
    }

    double startTime = getRealTime();
    int maxIter = *std::max_element(vec.begin(), vec.end());
    double stopTime = getRealTime();
    double totalIteratorTime = stopTime - startTime;

    startTime = getRealTime();
    int maxArray = my_max_element(vec, size);
    stopTime = getRealTime();
    double totalArrayTime = stopTime - startTime;

    std::cout << "MaxIter = " << maxIter << std::endl;
    std::cout << "MaxArray = " << maxArray << std::endl;
    std::cout << "Total CPU time iterator = " << totalIteratorTime << std::endl;
    std::cout << "Total CPU time array = " << totalArrayTime << std::endl;
    std::cout << "iter/array ratio: = " << totalIteratorTime / totalArrayTime << std::endl;
    return 0;
}

输出:

MaxIter = 1000000012
MaxArray = 1000000012
Total CPU time iterator = 0.000989199
Total CPU time array = 0.000293016
iter/array ratio: = 3.37592

平均std::max_element花的时间比my_max_element多x3. 那么，为什么我能这么容易地创建一个更快的std函数呢?我应该停止使用std并编写自己的函数，因为std太慢了吗?

on average std::max_element takes x3 more time then my_max_element. So why am I able to create a much faster std function so easily? Should I stop using std and write my own functions since std is so slow?

注意:起初我是因为是因为我在for循环中使用了整数i而不是迭代器，但这对现在来说已经无关紧要了.

Note: at first I though it was because I'm using and integer i in a for loop instead of an iterator, but that seams to not matter now.

编译信息:

g ++(GCC)4.8.2

g++ (GCC) 4.8.2

g ++ -O3-墙-c -fmessage-length = 0 -std = c ++ 0x

g++ -O3 -Wall -c -fmessage-length=0 -std=c++0x

推荐答案

在对此答案进行投票之前，请在您的计算机上进行测试(并验证)并评论/添加结果.请注意，我在测试中使用的矢量大小为1000 * 1000 * 1000.目前，此答案有19个投票，但只有一个发布的结果，这些结果未显示出下面描述的效果(尽管使用不同的测试代码获得，请参阅注释).

Before voting on this answer, please test (and verify) this on your machine and comment/add the results. Note that I used a vector size of 1000*1000*1000 for my tests. Currently, this answer has 19 upvotes but only one posted results, and these results did not show the effect described below (though obtained with a different test code, see comments).

似乎存在优化程序错误/工件.比较以下时间:

There seems to be an optimizer bug/artifact. Compare the times of:

template<typename _ForwardIterator, typename _Compare>
_ForwardIterator
my_max_element_orig(_ForwardIterator __first, _ForwardIterator __last,
_Compare __comp)
{
  if (__first == __last) return __first;
  _ForwardIterator __result = __first;

  while(++__first != __last)
    if (__comp(__result, __first))
      __result = __first;

  return __result;
}

template<typename _ForwardIterator, typename _Compare>
_ForwardIterator
my_max_element_changed(_ForwardIterator __first, _ForwardIterator __last,
_Compare __comp)
{
  if (__first == __last) return __first;
  _ForwardIterator __result = __first;
  ++__first;

  for(; __first != __last; ++__first)
    if (__comp(__result, __first))
      __result = __first;

  return __result;
}

第一个是原始libstdc ++实现，第二个应该是对行为或要求没有任何更改的转换. Clang ++对于这两个函数产生的运行时间非常相似，而第二个版本的g ++ 4.8.2快四倍.

The first is the original libstdc++ implementation, the second one should be a transformation without any changes in behaviour or requirements. Clang++ produces very similar run times for those two functions, whereas g++4.8.2 is four times faster with the second version.

按照Maxim的建议，将向量从int更改为int64_t，更改后的版本不是4，而仅是原始版本(g ++ 4.8.2)的1.7倍.

Following Maxim's proposal, changing the vector from int to int64_t, the changed version is not 4, but only 1.7 times faster than the original version (g++4.8.2).

区别在于*result的预测通用性，即存储当前max元素的值，这样就不必每次都从内存中重新加载它.这提供了一种更简洁的缓存访问模式:

The difference is in predictive commoning of *result, that is, storing the value of the current max element so that it does not have to be reloaded from memory each time. This gives a far cleaner cache access pattern:

w/o commoning     with commoning
*                 *
**                 *
 **                 *
  **                 *
  * *                 *
  *  *                 *
  *   *                 *

以下是用于比较的asm(rdi/rsi分别包含第一个/最后一个迭代器):

Here's the asm for comparison (rdi/rsi contain the first/last iterators respectively):

使用while循环(2.88743毫秒；要点):

With the while loop (2.88743 ms; gist):

    movq    %rdi, %rax
    jmp .L49
.L51:
    movl    (%rdi), %edx
    cmpl    %edx, (%rax)
    cmovl   %rdi, %rax
.L49:
    addq    $4, %rdi
    cmpq    %rsi, %rdi
    jne .L51

使用for循环(1235.55μs):

With the for loop (1235.55 μs):

    leaq    4(%rdi), %rdx
    movq    %rdi, %rax
    cmpq    %rsi, %rdx
    je  .L53
    movl    (%rdi), %ecx
.L54:
    movl    (%rdx), %r8d
    cmpl    %r8d, %ecx
    cmovl   %rdx, %rax
    cmovl   %r8d, %ecx
    addq    $4, %rdx
    cmpq    %rdx, %rsi
    jne .L54
.L53:

如果我通过在开始时将*result明确地存储到变量prev中并每次更新result并在比较中使用prev而不是*result来强行实现通用，那么我得到的循环更快(377.601μs):

If I force commoning by explicitly storing *result into a variable prev at the start and whenever result is updated, and using prev instead of *result in the comparison, I get an even faster loop (377.601 μs):

    movl    (%rdi), %ecx
    movq    %rdi, %rax
.L57:
    addq    $4, %rdi
    cmpq    %rsi, %rdi
    je  .L60
.L59:
    movl    (%rdi), %edx
    cmpl    %edx, %ecx
    jge .L57
    movq    %rdi, %rax
    addq    $4, %rdi
    movl    %edx, %ecx
    cmpq    %rsi, %rdi
    jne .L59
.L60:

之所以比for循环快，是因为上述条件移动(cmovl)很少执行( H _n 次，这是可以忽略的比例(H _n对数增长，因此H _n/n迅速接近0).条件移动代码只会在病理数据上更好，例如[1、0、3、2、5、4 ...].

The reason this is faster than the for loop is that the conditional moves (cmovl) in the above are a pessimisation as they are executed so rarely (Linus says that cmov is only a good idea if the branch is unpredictable). Note that for randomly distributed data the branch is expected to be taken H_n times, which is a negligible proportion (H_n grows logarithmically, so H_n/n rapidly approaches 0). The conditional-move code will only be better on pathological data e.g. [1, 0, 3, 2, 5, 4, ...].

这篇关于为什么C ++ std :: max_element这么慢?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

为什么C ++ std :: max_element这么慢? [英] why is c++ std::max_element so slow?

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录关闭

为什么C ++ std :: max_element这么慢? [英] why is c++ std::max_element so slow?

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录 关闭

登录关闭