为什么矢量总是比C数组慢,至少在这种情况下? [英] why vector is always slower than C array, at least in this case?

查看:141
本文介绍了为什么矢量总是比C数组慢,至少在这种情况下?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图找到所有的素数不大于 N 使用Eratosthenes'Sieve算法,我有以下的codeS,用筛子中实现矢量和C数组中,我发现,在所有的时间差不多,C数组总是快。

I am trying to find all the primes not greater than n using the Eratosthenes'Sieve algorithm, and I have the following codes, with the sieve implemented in vector and C array, I have found that almost during all the time, C array is always faster.

使用矢量:

int countPrimes_vector(int n) {                  
    int res = 0; 
    vector<char>bitmap(n);
    memset(&bitmap[0], '1', bitmap.size() * sizeof( bitmap[0]));
    //vector<bool>bitmap(n, true); Using this one is even slower!!

    for (int i = 2; i<n; ++i){

        if(bitmap[i]=='1')++res;
        if(sqrt(n)>i)
        {
             for(int j = i*i; j < n; j += i) bitmap[j] = '0';
        }
    }

    return res;
} 

用C数组:

int countPrimes_array(int n) {  

    int res = 0; 
    bool * bitmap = new bool[n];
    memset(bitmap, true, sizeof(bool) * n);
    for (int i = 2; i<n; ++i){

        if(bitmap[i])++res;
        if(sqrt(n)>i)
        {
             for(int j = i*i; j < n; j += i) bitmap[j] = false;
        }
    }
    delete []bitmap;
    return res;
}

测试code:

clock_t t;
t = clock();
int a;
for(int i=0; i<10; ++i)a = countPrimes_vector(8000000); 
t = clock() - t;
cout<<"time for vector = "<<t<<endl;

t = clock();
int b;
for(int i=0; i<10; ++i)b = countPrimes_array(8000000); 
t = clock() - t;
cout<<"time for array = "<<t<<endl;

输出:

 time for vector = 32460000
 time for array = 29840000

我已经测试过很多次,和C数组总是快。什么是它背后的原因是什么?

I have tested many times, and C array is always faster. What's the reason behind it?

我经常听到的矢量和C阵列的性能是一样的,矢量应始终用于作为一个标准集装箱。这是真实的陈述,或至少一般说来?在什么情况下数组c应该是preferred?

I often heard that the performance for vector and C array is the same, vector should be always used for being a standard container. Is this statement true, or at least generally speaking ? In what cases C array should be preferred?

编辑:

正如下面的评论暗示,开机优化之后 -O2 -O3 (最初它被编译 G ++ TEST.CPP ),矢量和C阵列之间的时间差不再有效,在某些场合矢量是比C数组更快。

As the following comments suggest, after turning on optimization -O2 or -O3 (originally it was compiled with g++ test.cpp), the time difference between vector and C array is no longer valid, in some occasions vector is faster than C array.

推荐答案

您比较包含这可以解释的差异不一致,另一个因素可能是不足够的优化编译的结果。某些实现在调试了很多额外的code的建立STL的,比如MSVC确实边界向量元素上存取检查产生在调试速度显著减少的基础之上。

Your comparisons contain inconsistencies which would explain the differences, and another factor could be the result of compiling without sufficient optimization. Some implementations have a lot of additional code in the debug builds of STL, for instance MSVC does bounds checking on vector element accesses that produce a significant reduction in speed in debug builds.

以下code显示了两者之间的更接近的性能,不同的是可能只是缺乏样品(ideone具有5秒的超时限制)。

The following code shows a MUCH closer performance between the two, and the difference is probably just a lack of samples (ideone has a timeout limit of 5s).

#include <vector>
#include <cmath>
#include <cstring>

int countPrimes_vector(int n) {  
    int res = 0; 
    std::vector<bool> bitmap(n, true);
    for (int i = 2; i<n; ++i){
        if(bitmap[i])
          ++res;
        if(sqrt(n)>i)
        {
             for(int j = i*i; j < n; j += i) bitmap[j] = false;
        }
    }
    return res;
}

int countPrimes_carray(int n) {  
    int res = 0; 
    bool* bitmap = new bool[n];
    memset(bitmap, true, sizeof(bool) * n);
    for (int i = 2; i<n; ++i){

        if(bitmap[i])++res;
        if(sqrt(n)>i)
        {
             for(int j = i*i; j < n; j += i) bitmap[j] = false;
        }
    }
    delete []bitmap;
    return res;
}

#include <chrono>
#include <iostream>

using namespace std;

void test(const char* description, int (*fn)(int))
{
    using clock = std::chrono::steady_clock;
    using ms = std::chrono::milliseconds;

    auto start = clock::now();

    int a;
    for(int i=0; i<9; ++i)
        a = countPrimes_vector(8000000); 

    auto end = clock::now();
    auto diff = std::chrono::duration_cast<ms>(end - start);

    std::cout << "time for " << description << " = " << diff.count() << "ms\n";
}

int main()
{
    test("carray", countPrimes_carray);
    test("vector", countPrimes_vector);
}

现场演示: http://ideone.com/0Y9gQx

time for carray = 2251ms
time for vector = 2254ms

虽然在某些运行CARRAY为1-2毫秒慢。再次,这是一个共享资源的样本不足。

Although on some runs the carray was 1-2 ms slower. Again, that's insufficient samples on a shared resource.

---编辑---

在您的主要意见,你问:为什么优化可以有所作为的。

In your main comments you ask "why optimization can make a difference".

std::vector<bool> v = { 1, 2, 3 };
bool b[] = { 1, 2, 3 };

我们有3个元素的两个数组S,所以考虑以下内容:

We have two "array"s of 3 elements, so consider the following:

v[10]; // illegal!
b[10]; // illegal!

STL的调试版本往往能抓住这个运行过程中的时间(以及某些情况下,编译时间)。数组访问可能只是导致坏数据或崩溃。

Debug versions of STL can often catch this during run time (and with some scenarios, compile time). The array access may just result in bad data or a crash.

此外,STL是用许多小的成员函数调用像东西的大小(),因为矢量是一个类, [] 实际上是通过一个函数调用facaded(运算符[] )。

Additionally, the STL is implemented using many small member-function calls to things like size(), and because vector is a class, [] is actually facaded through a function call (operator[]).

编译器可以消除许多这样的,但是这是最优化。如果你不优化,那么像

The compiler can eliminate many of these, but that's optimization. If you don't optimize, then something like

std::vector<int> v;
v[10];

做了大致是:

int* data() { return M_.data_; }

v.operator[](size_t idx = 10) {
    if (idx >= this->size()) {
        raise exception("invalid [] access");
    }
    return *(data() + idx);
}

,即使数据是一个可以内联的功能,使调试容易,没有优化的code离开它,因为这。当你建立与优化,编译器会认识到,这些功能的实现是那么微不足道它可以直接代替其实现到调用点,并迅速卷起所有上述简化到像操作多个阵列的访问。

and even though data is an "inlinable" function, to make debugging easier, the unoptimized code leaves it as this. When you build with optimization, the compiler recognizes that the implementation of these functions are so trivial it can just substitute their implementations into the call sites, and it quickly winds up simplifying all of the above to a more array-access like operation.

例如,在上述情况下,它可能会首先减少运算符[]

For example, in the above case, it may first reduce operator[] to

v.operator[](size_t idx = 10) {
    if (idx >= this->size()) {
        raise exception("invalid [] access");
    }
    return *(M_.data_ + idx);
}

和因为没有调试编译可能消除了边界检查,就变成

And since compiling without debugging probably removes the bounds check, it becomes

v.operator[](size_t idx = 10) {
    return *(M_.data_ + idx);
}

所以现在的内联可以减少

so now the inliner can reduce

x = v[1];

x = *(v.M_.data_ + 1); // comparable to v.M_.data_[1];

的一个很小的代价。的c阵列包括数据块中的内存和配合到指向该块寄存器单个本地变量,你的引用是直接相对的是:

There is a tiny penalty. The c-array involves the data block in memory and a single local variable that fits into a register that points to the block, your references are directly relative to that:

使用的载体,但是,你有矢量对象,它是一个指向数据,大小和可变容量:

With a vector, though, you have the vector object which is a pointer to the data, a size and a capacity variable:

vector<T>  // pseudo code
{
    T* ptr;
    size_t size;
    size_t capacity;
}

如果您是计数机器指令,向量将有3个变量的初始化和管理。

If you were counting machine instructions, the vector will have 3 variables to initialize, and manage.

当你写

x = v[1];

鉴于上述近似矢量

,你说的线沿线的东西:

given the above approximation of vector, you are saying something along the lines of:

T* ptr = v.data();
x = ptr[1];

但编译器通常是足够聪明的优化与建设时,要认识到它可以做循环之前的第一线,但是这往往花费寄存器。

but the compiler is usually smart enough when building with optimization to recognize that it can do the first line before the loop, but this tends to cost a register.

T* ptr = v.data(); // in debug, function call, otherwise inlined.
for ... {
    x = ptr[1];
}

所以,你可能看一把,或者在现代的处理器,也许纳秒或两个额外的挂钟时间。每次测试功能的多个迭代机器指令

So you're probably looking at a handful more machine instructions per iteration of your test function, or on a modern processor, maybe a nanosecond or two of extra wall time.

这篇关于为什么矢量总是比C数组慢,至少在这种情况下?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆