由于标准容器中的元素的默认初始化导致的性能下降 [英] Performance degradation due to default initialisation of elements in standard containers

查看:110
本文介绍了由于标准容器中的元素的默认初始化导致的性能下降的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

(是,我知道有问题几乎相同的标题,但答案不令人满意,见下文)



使用编译器优化。这是现在固定的,但为了避免琐细的优化和接近我的实际用例,测试已分为两个编译单元。



事实上,对于性能关键的应用程序, std :: vector<> 的构造函数具有线性复杂性是一个讨厌的事。考虑这个简单的代码

  //编译单元1:
void set_v0(type * x,size_t n)
{
for(size_t i = 0; i x [i] = simple_function(i);
}

//编译单元2:
std :: vector< type> x(n); //默认初始化是浪费
set_v0(x.data(),n); //通过构造<$ c $来浪费大量时间时,重写初始值



c> x 。传统的方式,如这个问题所探讨的,似乎只是储备存储并使用 push_back()填写数据:

  //编译单元1:
void set_v1(std :: vector< type>& x,size_t n)
{
x.reserve(n);
for(size_t i = 0; i x.push_back(simple_function(i));
}

//编译单位2:
std :: vector< type> X(); x.reserve(n); // no initialisation
set_v1(x,n); // using push_back()

但是,如我的意见所示, push_back()本质上很慢,使得第二种方法实际上比第一种方法更慢用于足够简单的可构造对象,例如 size_t s,when for

  simple_function = [](size_t i){return i; }; 

我得到以下时间(使用gcc 4.8与-O3; clang 3.2生成〜10% )

  timing vector :: vector(n)+ set_v0 
n = 10000时间:3.9e-05秒
n = 100000时间:0.00037秒
n = 1000000时间:0.003678秒
n = 10000000时间:0.03565秒
n = 100000000时间:0.373275 sec

定时矢量:: vector()+ vector :: reserve(n)+ set_v1();
n = 10000时间:1.9e-05秒
n = 100000时间:0.00018秒
n = 1000000时间:0.00177秒
n = 10000000时间:0.020829秒
n = 100000000时间:0.435393 sec

如果可以省略元素的默认结构,通过以下作弊版本

  //编译单元2 
std :: vector< type> X; x.reserve(n); // no initialisation
set_v0(x,n); //错误:写向量外的结束
//注意:vector :: size()== 0

当我们得到

  timing vector :: vector + vector :: reserve(n)+ set_v0 (CHEATING)
n = 10000 time:8e-06 sec
n = 100000 time:7.2e-05 sec
n = 1000000 time:0.000776 sec
n = 10000000 time:0.01119 sec
n = 100000000 time:0.298024 sec

因此,我的第一个问题:有没有任何法律方式使用标准库容器,将给这些后面的时间?或者,我必须自己管理内存吗?



现在,我真正想要的是使用多线程填充容器。天真的代码(为了简单起见,在这个例子中使用openMP,暂时不包括俚语)

  //编译单元1 
void set_v0(type * x,size_t n)
{
#pragma omp for //只与上面的set_v0()区别
for(size_t i = 0; i < ++ i)
x [i] = simple_function(i);
}

//编译单元2:
std :: vector< type> x(n); //默认初始化不是多线程的
#pragma omp parallel
set_v0(x,n); //并行重写初始值

现在受到以下事实的影响:所有元素的默认初始化不是多线程的,导致潜在的严重性能下降。这里是 set_omp_v0()和一个等效的欺骗方法(使用我的macbook的intel i7芯片与4核心,8个超线程)的时间:

  timing std :: vector :: vector(n)+ omp parallel set_v0()
n = 10000 time:0.000389 sec
n = 100000 time: 0.000226秒
n = 1000000时间:0.001406秒
n = 10000000时间:0.019833秒
n = 100000000时间:0.35531秒

时间矢量::矢量+矢量::保留(n)+ omp parallel set_v0(); (CHEATING)
n = 10000时间:0.000222秒
n = 100000时间:0.000243秒
n = 1000000时间:0.000793秒
n = 10000000时间:0.008952秒
n = 100000000时间:0.089619 sec

请注意,欺骗版本比串行欺骗版本快约3.3倍

:有没有合法的方式使用标准库容器在多线程的情况下给这些后面的时间?



PS。我发现这个问题,其中 std :: vector 通过为它提供一个 uninitialized_allocator 来进行默认初始化。
这不再是标准兼容,但对我的测试用例非常好(请参见我自己的答案和这个问题)。

解决方案



Q1 有任何法律方式使用标准库容器,时间?在一定程度上,如Mark和Evgeny的答案所示。向 std :: vector 的构造函数提供生成器的方法消除了默认构造。



Q2 没有,我没有使用标准库容器的任何合法方法(在多线程的情况下,也这样觉得。原因是,在构建 任何符合标准的容器必须初始化其元素 ,已经确保调用元素析构函数(在销毁或调整容器大小时)形成良好。由于std库容器不支持在构造元素中使用多线程,因此 Q1 的窍门不能在这里复制,所以我们不能省略默认的结构。



因此,如果我们想使用C ++进行高性能计算,在管理大量数据时,我们的选择有限。我们可以



1 声明一个容器对象,并在同一个编译单元中立即填充它(同时),当编译器希望优化在构建时初始化;



2 使用 new [] delete [] 或甚至 malloc()所有的内存管理,在后一种情况下,元素的构造是我们的责任,我们对C ++标准库的潜在使用非常有限。



3 使用自定义 unitialised_allocator 不能初始化其元素删除默认结构。在创意 Jared Hoberock 这样的分配器可能看起来像这样(另见这里):

  //基于设计Jared Hoberock 
//编辑(Walter)10-May-2013,2014年4月23日
模板< typename T,typename base_allocator>
struct uninitialised_allocator
:base_allocator
{
static_assert(std :: is_same< T,typename base_allocator :: value_type> :: value,
allocator :: value_type mismatch );

template< typename U>
using base_t =
typename std :: allocator_traits< base_allocator> :: template rebind_alloc< U>

//重新绑定到base_t< U>对于所有U!= T:我们不会离开其他类型未初始化!
template< typename U>
struct rebind
{
typedef typename
std :: conditional< std :: is_same< T,U> :: value,
uninitialised_allocator,base_t& > :: type other;
}

//只有T型对象的缺省构造
template< typename U>
typename std :: enable_if< std :: is_same< T,U> :: value&&
std :: is_trivially_default_constructible< U> :: value> :: type
construct(U *){}

// elide琐碎的默认破坏类型T的对象
template< typename U>
typename std :: enable_if< std :: is_same< T,U> :: value&&
std :: is_trivially_destructible< U> :: value> :: type
destroy(U *){}

//将一切转移到基础
base_allocator :: construct;
using base_allocator :: destroy;
};

然后, unitialised_vector<> 定义如下:

  template< typename T,typename base_allocator = std :: allocator< T> 
using uninitialised_vector = std :: vector< T,uninitialised_allocator< T,base_allocator>>

,我们仍然可以使用几乎所有的标准库的功能。虽然必须说, uninitialised_allocator ,因此暗示 unitialised_vector 不符合标准,因为它的元素不是默认构造的(例如向量< int> 在构造之后不会具有所有 0 )。



当我的小测试问题使用这个工具,我得到很好的结果:

  vector :: vector(n)+ set_v0(); 
n = 10000时间:3.7e-05秒
n = 100000时间:0.000334秒
n = 1000000时间:0.002926秒
n = 10000000时间:0.028649秒
n = 100000000时间:0.293433 sec

timing vector :: vector()+ vector :: reserve()+ set_v1();
n = 10000时间:2e-05秒
n = 100000时间:0.000178秒
n = 1000000时间:0.001781秒
n = 10000000时间:0.020922秒
n = 0.428243 sec

时间矢量:: vector()+ vector :: reserve()+ set_v0();
n = 10000时间:9e-06秒
n = 100000时间:7.3e-05秒
n = 1000000时间:0.000821秒
n = 10000000时间:0.011685秒
n = 100000000时间:0.291055秒

时间矢量:: vector(n)+ omp parllel set_v0();
n = 10000时间:0.00044秒
n = 100000时间:0.000183秒
n = 1000000时间:0.000793秒
n = 10000000时间:0.00892秒
n = 100000000时间:0.088051秒

timing vector :: vector()+ vector :: reserve()+ omp parallel set_v0();
n = 10000时间:0.000192秒
n = 100000时间:0.000202秒
n = 1000000时间:0.00067秒
n = 10000000时间:0.008596秒
n = 100000000时间:0.088045秒当作弊和合法版本之间没有任何差异时,


(Yes, I know there is a question with almost the same title, but the answer was not satisfactory, see below)

EDIT Sorry, the original question didn't use compiler optimization. This is now fixed, but to avoid trivial optimization and to come closer to my actual use case, the test has been split into two compilation units.

The fact that the constructor of std::vector<> has linear complexity is a nuisance when it comes to performance-critical applications. Consider this simple code

// compilation unit 1:
void set_v0(type*x, size_t n)
{
  for(size_t i=0; i<n; ++i)
    x[i] = simple_function(i);
}

// compilation unit 2:
std::vector<type> x(n);                     // default initialisation is wasteful
set_v0(x.data(),n);                         // over-writes initial values

when a significant amount of time is wasted by constructing x. The conventional way around this, as explored by this question, seems to be to merely reserve the storage and use push_back() to fill in the data:

// compilation unit 1:
void set_v1(std::vector<type>&x, size_t n)
{
  x.reserve(n);
  for(size_t i=0; i<n; ++i)
    x.push_back(simple_function(i));
}

// compilation unit 2:
std::vector<type> x(); x.reserve(n);        // no initialisation
set_v1(x,n);                                // using push_back()

However, as indicated by my comment, the push_back() is inherently slow, making this second approach actually slower than the first one for sufficiently simply constructible objects, such as size_ts, when for

simple_function = [](size_t i) { return i; };

I get the following timings (using gcc 4.8 with -O3; clang 3.2 produced ~10% slower code)

timing vector::vector(n) + set_v0();
n=10000 time: 3.9e-05 sec
n=100000 time: 0.00037 sec
n=1000000 time: 0.003678 sec
n=10000000 time: 0.03565 sec
n=100000000 time: 0.373275 sec

timing vector::vector() + vector::reserve(n) + set_v1();
n=10000 time: 1.9e-05 sec
n=100000 time: 0.00018 sec
n=1000000 time: 0.00177 sec
n=10000000 time: 0.020829 sec
n=100000000 time: 0.435393 sec

The speed-up actually possible if one could elide the default construction of elements can be estimated by the following cheating version

// compilation unit 2
std::vector<type> x; x.reserve(n);          // no initialisation
set_v0(x,n);                                // error: write beyond end of vector
                                            // note: vector::size() == 0

when we get

timing vector::vector + vector::reserve(n) + set_v0();          (CHEATING)
n=10000 time: 8e-06 sec
n=100000 time: 7.2e-05 sec
n=1000000 time: 0.000776 sec
n=10000000 time: 0.01119 sec
n=100000000 time: 0.298024 sec

So, my first question: Is there any legal way to use a standard library container which would give these latter timings? Or do I have to resort to manage the memory myself?

Now, what I really want, is to use multi-threading to fill in the container. The naive code (using openMP in this example for simplicity, which excludes clang for the moment)

// compilation unit 1
void set_v0(type*x, size_t n)
{
#pragma omp for                       // only difference to set_v0() from above 
  for(size_t i=0; i<n; ++i)
    x[i] = simple_function(i);
}

// compilation unit 2:
std::vector<type> x(n);               // default initialisation not mutli-threaded
#pragma omp parallel
set_v0(x,n);                          // over-writes initial values in parallel

now suffers from the fact that the default initialization of all elements is not multi-threaded, resulting in an potentially serious performance degradation. Here are the timings for set_omp_v0() and a equivalent cheating method (using my macbook's intel i7 chip with 4 cores, 8 hyperthreads):

timing std::vector::vector(n) + omp parallel set_v0()
n=10000 time: 0.000389 sec
n=100000 time: 0.000226 sec
n=1000000 time: 0.001406 sec
n=10000000 time: 0.019833 sec
n=100000000 time: 0.35531 sec

timing vector::vector + vector::reserve(n) + omp parallel set_v0(); (CHEATING)
n=10000 time: 0.000222 sec
n=100000 time: 0.000243 sec
n=1000000 time: 0.000793 sec
n=10000000 time: 0.008952 sec
n=100000000 time: 0.089619 sec

Note that the cheat version is ~3.3 times faster than the serial cheat version, roughly as expected, but the standard version is not.

So, my second question: Is there any legal way to use a standard library container which would give these latter timings in multi-threaded situations?

PS. I found this question, where std::vector is tricked into avoiding the default initialization by providing it with a uninitialized_allocator. This is no longer standard compliant, but works very well for my test case (see my own answer below and this question for details).

解决方案

Okay, here is what I've learned since asking this question.

Q1 (Is there any legal way to use a standard library container which would give these latter timings?) Yes to some degree, as shown in the answers by Mark and Evgeny. The method of providing a generator to the constructor of std::vector elides the default construction.

Q2 (Is there any legal way to use a standard library container which would give these latter timings in multi-threaded situations?) No, I don't think so. The reason is that on construction any standard-compliant container must initialise its elements, already to ensure that the call to the element destructors (upon destruction or resizing of the container) is well-formed. As the std library containers do not support the usage of multi-threading in constructing their elements, the trick of Q1 cannot be replicated here, so we cannot elide the default construction.

Thus, if we want to use C++ for high-performance computing our options are somewhat limited when it comes to managing large amounts of data. We can

1 declare a container object and, in the same compilation unit, immediately fill it (concurrently), when the compiler hopefully optimizes the initialization at construction away;

2 resort to new[] and delete[] or even malloc() and free(), when all the memory management and, in the latter case, construction of elements is our responsibility and our potential usage of the C++ standard library very limited.

3 trick a std::vector to not initialise its elements using a custom unitialised_allocator that elides the default construction. Following the ideas of Jared Hoberock such an allocator could look like this (see also here):

// based on a design by Jared Hoberock
// edited (Walter) 10-May-2013, 23-Apr-2014
template<typename T, typename base_allocator >
struct uninitialised_allocator
  : base_allocator
{
  static_assert(std::is_same<T,typename base_allocator::value_type>::value,
                "allocator::value_type mismatch");

  template<typename U>
  using base_t =
    typename std::allocator_traits<base_allocator>::template rebind_alloc<U>;

  // rebind to base_t<U> for all U!=T: we won't leave other types uninitialised!
  template<typename U>
  struct rebind
  {
    typedef typename
    std::conditional<std::is_same<T,U>::value,
                     uninitialised_allocator, base_t<U> >::type other; 
  }

  // elide trivial default construction of objects of type T only
  template<typename U>
  typename std::enable_if<std::is_same<T,U>::value && 
                          std::is_trivially_default_constructible<U>::value>::type
  construct(U*) {}

  // elide trivial default destruction of objects of type T only
  template<typename U>
  typename std::enable_if<std::is_same<T,U>::value && 
                          std::is_trivially_destructible<U>::value>::type
  destroy(U*) {}

  // forward everything else to the base
  using base_allocator::construct;
  using base_allocator::destroy;
};

Then an unitialised_vector<> template could be defined like this:

template<typename T, typename base_allocator = std::allocator<T>>
using uninitialised_vector = std::vector<T,uninitialised_allocator<T,base_allocator>>;

and we can still use almost all of the standard library's functionality. Though it must be said that the uninitialised_allocator, and hence by implication the unitialised_vector are not standard compliant, because its elements are not default constructed (e.g. a vector<int> will not have all 0 after construction).

When using this tool for my little test problem, I get excellent results:

timing vector::vector(n) + set_v0();
n=10000 time: 3.7e-05 sec
n=100000 time: 0.000334 sec
n=1000000 time: 0.002926 sec
n=10000000 time: 0.028649 sec
n=100000000 time: 0.293433 sec

timing vector::vector() + vector::reserve() + set_v1();
n=10000 time: 2e-05 sec
n=100000 time: 0.000178 sec
n=1000000 time: 0.001781 sec
n=10000000 time: 0.020922 sec
n=100000000 time: 0.428243 sec

timing vector::vector() + vector::reserve() + set_v0();
n=10000 time: 9e-06 sec
n=100000 time: 7.3e-05 sec
n=1000000 time: 0.000821 sec
n=10000000 time: 0.011685 sec
n=100000000 time: 0.291055 sec

timing vector::vector(n) + omp parllel set_v0();
n=10000 time: 0.00044 sec
n=100000 time: 0.000183 sec
n=1000000 time: 0.000793 sec
n=10000000 time: 0.00892 sec
n=100000000 time: 0.088051 sec

timing vector::vector() + vector::reserve() + omp parallel set_v0();
n=10000 time: 0.000192 sec
n=100000 time: 0.000202 sec
n=1000000 time: 0.00067 sec
n=10000000 time: 0.008596 sec
n=100000000 time: 0.088045 sec

when there is no difference any more between the cheating and "legal" versions.

这篇关于由于标准容器中的元素的默认初始化导致的性能下降的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆