调整大小std :: vector< std :: unique_ptr< T>的性能 [英] Performance of resizing std::vector<std::unique_ptr<T>>

查看:88
本文介绍了调整大小std :: vector< std :: unique_ptr< T>的性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

一般概念似乎是 std :: unique_ptr 具有没有时间开销。 com / questions / 28896490 / why-is-stdunique-ptr-为什么比优化之前的标准指针慢得多?。



但是在复合数据结构中使用 std :: unique_ptr 怎么办,特别是 std :: vector< std :: unique_ptr< T>> ?例如,调整向量的基础数据的大小,这可能在 push_back 期间发生。为了隔离性能,我在 pop_back shrink_to_fit emplace_back

  #include< chrono> 
#include< vector>
#include< memory>
#include< iostream>

constexpr size_t size = 1000000;
constexpr size_t repeat = 1000;
使用my_clock = std :: chrono :: high_resolution_clock;

template< class T>
自动测试(std :: vector< T& v){
v.reserve(size);
for(size_t i = 0; i< size; i ++){
v.emplace_back(new int());
}
auto t0 = my_clock :: now();
for(int i = 0; i< repeat; i ++){
自动返回= std :: move(v.back());
v.pop_back();
v.shrink_to_fit();
如果(back == nullptr)抛出不要让我失望;
v.emplace_back(std :: move(back));
}
返回my_clock :: now()-t0;
}

int main(){
std :: vector< std :: unique_ptr< int> v_u;
std :: vector< int *> v_p;

auto millis_p = std :: chrono :: duration_cast< std :: chrono :: milliseconds>(test(v_p));
auto millis_u = std :: chrono :: duration_cast< std :: chrono :: milliseconds>(test(v_u));
std :: cout<< 原始指针:<< millis_p.count()<< ms,unique_ptr:<< millis_u.count()<< ms\n;
用于(自动p:v_p)删除p; //我不喜欢内存泄漏;-)
}

使用 -O3 -o -march = native -std = c ++ 14 -g 在Intel的Linux上具有gcc 7.1.0,clang 3.8.0和17.0.4 Xeon E5-2690 v3 @ 2.6 GHz(无Turbo):

 原始指针:2746 ms,unique_ptr:5140 ms(gcc) 
原始指针:2667 ms,unique_ptr:5529 ms(clang)
原始指针:1448 ms,unique_ptr:5374 ms(intel)

原始指针版本将所有时间都花在优化的 memmove 中(intel似乎比clang和gcc)。 unique_ptr 代码似乎首先将矢量数据从一个存储块复制到另一个存储块,然后将原始存储块分配为零-所有这些都在极其未经优化的循环中进行。然后,它再次循环遍历原始数据块,以查看那些只是零的数据是否为非零并且需要删除。您可以在 godbolt 上查看完整的血腥细节。 问题不是编译后的代码有何不同,这很清楚。问题是为什么,编译器无法优化通常被视为无额外开销的抽象。



试图了解编译器的方式处理 std :: unique_ptr 的原因,我只是在看一些孤立的代码。例如:

  void foo(std :: unique_ptr< int>& a,std :: unique_ptr< int>& b){
a.release();
a = std :: move(b);
}

或类似的

  a.release(); 
a.reset(b.release());

没有一个x86编译器似乎可以进行优化毫无意义的 if(ptr)delete ptr; 。英特尔编译器甚至给予删除28%的机会。出乎意料的是,对于以下项始终删除删除检查:

  auto tmp = b.release(); 
a.release();
a.reset(tmp);

这些位不是这个问题的主要方面,但是所有这些使我感到自己



为什么各种编译器无法优化 std :: vector< std :: unique_ptr< int>> 中的重新分配c $ c>?标准中是否存在阻止生成代码的效率与原始指针一样高的东西?这是标准库实现的问题吗?还是编译器不够聪明(还)?



与使用原始指针相比,可以做些什么来避免性能受到影响?



注意:假设 T 是多态的并且移动起来很昂贵,所以 std :: vector< T>

解决方案

关于 unique_ptr 的主张性能和优化后的原始指针 一样,通常仅适用于单个指针的基本操作,例如创建,取消引用,单个指针的分配和删除。这些操作的定义非常简单,以使优化的编译器通常可以进行所需的转换,以使生成的代码在性能上与原始版本 0 等效(或几乎等效)。



在基于数组的容器(例如 std :: vector 基于高级语言的优化尤其如此c>,正如您在测试中所指出的。这些容器通常使用 source level 优化,该优化取决于类型特征,以在编译时确定是否可以使用字节型副本(例如 memcpy ,如果有的话,则委派给这种方法,否则将退回到一个按元素复制的循环。



要安全地用 memcpy 对象必须是 可轻松复制 。现在 std :: unique_ptr 不可复制,因为它确实使要求,例如仅具有琐碎或已删除的复制和移动构造函数。确切的机制取决于所涉及的标准库,但是通常,质量 std :: vector 的实现最终会调用诸如 std :: uninitialized_copy 只是可复制类型委派给 memomove



典型的实现细节颇受折磨,但对于 libstc ++ (由 gcc 使用),您可以在 std :: uninitialized_copy

 模板<类型名_InputIterator,类型名_ForwardIterator> 
内联_ForwardIterator
uninitialized_copy(_InputIterator __first,_InputIterator __last,
_ForwardIterator __result)
{
...
return std :: __ uninitialized_copy< __ is_trivial(_ValueType1 )
&& __is_trivial(_ValueType2)
&& __assignable> ::
__uninit_copy(__ first,__last,__result);
}

从那里你可以相信,许多 std :: vector 运动方法在这里结束,并且 __ uninitialized_copy< true> :: __ uinit_copy(...)最终调用 memmove ,而< false> 版本没有-或者您可以自己遍历代码(但是您已经看到了



最后,您最终遇到了几个循环,这些循环为非平凡的对象执行所需的复制步骤,例如调用move构造函数目标对象,然后调用所有源对象的析构函数。这些是单独的循环,即使是现代编译器也几乎无法推理 OK,在第一个循环中,我移动了所有目标对象,因此它们的 ptr 成员将为空,因此第二个循环为无操作。最后,为了达到原始指针的速度,编译器不仅需要在这两个循环之间进行优化,而且还需要进行转换,以识别整个内容可以用 memcpy 内存 2



所以对您的问题的一个答案是编译器不够聪明,无法进行此优化,但这主要是因为原始版本具有大量的编译时帮助,从而完全无需进行此优化。



Loop Fusion



如上所述,现有的 vector 实现在两个单独的循环中实现了resize-type操作(除了非循环工作,例如分配新存储并释放旧存储):




  • 将源对象复制到新分配的目标数组(从概念上讲,使用诸如placement new之类的方法调用move构造函数)。

  • 在o中销毁源对象ld区域。



从概念上讲,您可以想像另一种方法:在一个循环中完成所有操作,复制每个元素,然后立即销毁它们。编译器甚至可能注意到这两个循环在相同的一组值上进行迭代,并且将两个循环 fuse 合并为一个。 [显然],但是,(( https://gcc.gnu.org/ ml / gcc / 2015-04 / msg00291.html gcc 今天不执行任何 loop融合,并且 clang icc (如果您相信此测试



因此,我们只能尝试在源代码级别将循环明确地放在一起。



现在,两循环实现通过直到我们知道副本的构造部分已完成才销毁任何源对象,从而帮助保留了操作的异常安全约定,但还有助于优化副本和当我们分别拥有平凡可复制的对象和平凡可破坏的对象时,将导致销毁。特别是,通过基于简单特征的选择,我们可以用 memmove 替换副本,并且可以完全消除破坏循环 3



因此,当应用这些优化时,双循环方法会有所帮助,但对于通常无法复制或销毁的对象,这种做法实际上是有害的。这意味着您需要在对象上进行两次通过,并且失去了在对象副本及其后续销毁之间进行优化和消除代码的机会。在 unique_ptr 的情况下,编译器无法传播以下知识:源 unique_ptr 将具有 NULL 内部 ptr 成员,因此跳过 if(ptr)delete ptr 完全检查 4



微不足道的移动



现在人们可能会问我们是否可以将相同的类型特征编译时优化应用于 unique_ptr 的情况。例如,人们可能会看一下可复制的要求,并且发现它们对于 std :: vector 。当然, unique_ptr 显然不是简单可复制的,因为按位复制会由于相同的指针而使源对象和目标对象都留着(并导致两次删除),但是看来它应该是按位的可移动:如果将 unique_ptr 从一个内存区域移动到另一个内存区域,则您不再考虑源作为活动对象(因此不会称为其析构函数),对于典型的 unique_ptr 实现,它应该正常工作。 / p>

不幸的是,尽管您可以尝试自己动手,但不存在这样的小举动概念。似乎有一个对于在移动场景中可以按字节复制并且不依赖于其构造函数或析构函数行为的对象来说,是否为UB是公开辩论。



您总是可以实现自己的琐碎可移动的概念,类似于(a)对象具有琐碎的move构造函数,而(b)用作将对象保留在move构造函数的源参数时,析构函数无效的状态。注意,这样的定义当前几乎没有用,因为琐碎的移动构造函数(基本上是元素方式的复制,而没有别的)与源对象的任何修改都不一致。因此,例如,琐碎的移动构造函数无法将源 unique_ptr ptr 成员设置为零。因此,您需要克服一些麻烦,例如引入破坏性移动操作的概念,该操作会使源对象被破坏,而不是处于有效但未指定的状态。



您可以在此线程。具体来说,在链接答复中,解决了 unique_ptr 向量的确切问题:


事实证明,许多智能指针(包括unique_ptr和shared_ptr)
都属于这三个类别,通过应用它们,您可以使
具有智能指针矢量,而原始$ b的开销基本上为零甚至在未优化的调试版本中也可以使用$ b指针。


另请参见定位器提案。






0 尽管问题末尾的非矢量示例表明情况并非总是如此。这是由于zneak在他的答案中解释的可能出现的别名。原始指针将避免许多此类混叠问题,因为它们缺少 unique_ptr 具有的间接性(例如,您按值传递原始指针,而不是按引用传递具有指针的结构) ),通常可以省略 if(ptr)完全删除ptr 检查。



2 这实际上比您想像的要难,因为例如,当源和目标重叠时,内存的语义与对象复制循环的语义有所不同。当然,适用于原始点的高级类型特征代码(按合同)知道不存在重叠,或者即使存在重叠,内存的行为也是一致的,但是在以后的任意优化过程中证明同一件事可能会困难得多。



3 重要的是要注意,这些优化更多或更多不太独立。例如,许多对象是不可破坏的,它们在at不可复制。



4 尽管在我的测试 gcc clang 都不即使应用了 __ restrict __ ,也能够抑制检查,这显然是由于别名分析功能不够强大,或者是由于 std :: move 以某种方式剥离限制限定词。


The general conception seems to be that std::unique_ptr has no time overhead compared to properly used owning raw pointers, given sufficient optimization.

But what about using std::unique_ptr in compound data structures, in particular std::vector<std::unique_ptr<T>>? For instance, resizing the underlying data of a vector, which can happen during push_back. To isolate the performance, I loop around pop_back, shrink_to_fit, emplace_back:

#include <chrono>
#include <vector>
#include <memory>
#include <iostream>

constexpr size_t size = 1000000;
constexpr size_t repeat = 1000;
using my_clock = std::chrono::high_resolution_clock;

template<class T>
auto test(std::vector<T>& v) {
    v.reserve(size);
    for (size_t i = 0; i < size; i++) {
        v.emplace_back(new int());
    }
    auto t0 = my_clock::now();
    for (int i = 0; i < repeat; i++) {
        auto back = std::move(v.back());
        v.pop_back();
        v.shrink_to_fit();
        if (back == nullptr) throw "don't optimize me away";
        v.emplace_back(std::move(back));
    }
    return my_clock::now() - t0;
}

int main() {
    std::vector<std::unique_ptr<int>> v_u;
    std::vector<int*> v_p;

    auto millis_p = std::chrono::duration_cast<std::chrono::milliseconds>(test(v_p));
    auto millis_u = std::chrono::duration_cast<std::chrono::milliseconds>(test(v_u));
    std::cout << "raw pointer: " << millis_p.count() << " ms, unique_ptr: " << millis_u.count() << " ms\n";
    for (auto p : v_p) delete p; // I don't like memory leaks ;-)
}

Compiling the code with -O3 -o -march=native -std=c++14 -g with gcc 7.1.0, clang 3.8.0, and 17.0.4 on Linux on a Intel Xeon E5-2690 v3 @ 2.6 GHz (no turbo):

raw pointer: 2746 ms, unique_ptr: 5140 ms  (gcc)
raw pointer: 2667 ms, unique_ptr: 5529 ms  (clang)
raw pointer: 1448 ms, unique_ptr: 5374 ms  (intel)

The raw pointer version spends all it's time in an optimized memmove (intel seems to have a much better one than clang and gcc). The unique_ptr code seems to first copy over the vector data from one memory block to the other and assign the original one with zero - all in a horribly un-optimized loop. And then it loops over the original block of data again to see if any of those that were just zero'd are nonzero and need to be deleted. The full gory detail can be seen on godbolt. The question is not how the compiled code differs, that is pretty clear. The question is why the compiler fails to optimize what is generally regarded as a no-extra-overhead abstraction.

Trying to understand how the compilers reason about handling std::unique_ptr, I was looking a bit more at isolated code. For instance:

void foo(std::unique_ptr<int>& a, std::unique_ptr<int>& b) {
  a.release();
  a = std::move(b);
}

or the similar

a.release();
a.reset(b.release());

none of the x86 compilers seem to be able to optimize away the senseless if (ptr) delete ptr;. The Intel compiler even gives the delete a 28 % chance. Surprisingly, the delete check is consistently omitted for:

auto tmp = b.release();
a.release();
a.reset(tmp);

These bits are not the main aspect of this question, but all of this makes me feel that I am missing something.

Why do various compilers fail to optimize reallocation within std::vector<std::unique_ptr<int>>? Is there anything in the standard that prevents generating code as efficient as with raw pointers? Is this an issue with the standard library implementation? Or are the compilers just not sufficiently clever enough (yet)?

What can one do to avoid performance impact compared to using raw pointers?

Note: Assume that T is polymorphic and expensive to move, so std::vector<T> is not an option.

解决方案

The claim that unique_ptr performs as well as a raw pointer after optimization mostly applies only to the basic operations on a single pointer, such as creation, dereferencing, assignment of a single pointer and deletion. Those operations are defined simply enough that an optimizing compiler can usually make the required transformations such that the resulting code is equivalent (or nearly so) in performance to the raw version0.

One place this falls apart is especially higher level language-based optimizations on array-based containers such as std::vector, as you have noted with your test. These containers usually use source level optimizations which depend on type traits to determine at compile time if a type can safely be copied using a byte-wise copy such as memcpy, and delegate to such a method if so, or otherwise fall back to an element-wise copy loop.

To be safely copyable with memcpy an object must be trivially copyable. Now std::unique_ptr is not trivially copyable since indeed it fails several of the requirements such as having only trivial or deleted copy and move constructors. The exact mechanism depends on the standard library involved, but in general a quality std::vector implementation will end up calling a specialized form of something like std::uninitialized_copy for trivially-copyable types that just delegates to memmove.

The typical implementation details are quite tortured, but for libstc++ (used by gcc) you can see the high-level divergence in std::uninitialized_copy:

 template<typename _InputIterator, typename _ForwardIterator>
 inline _ForwardIterator
 uninitialized_copy(_InputIterator __first, _InputIterator __last,
                    _ForwardIterator __result)
 {
        ...
   return std::__uninitialized_copy<__is_trivial(_ValueType1)
                                    && __is_trivial(_ValueType2)
                                    && __assignable>::
     __uninit_copy(__first, __last, __result);
 }

From there you can take my word that many of the std::vector "movement" methods end up here, and that __uninitialized_copy<true>::__uinit_copy(...) ultimately calls memmove while the <false> version doesn't - or you can trace through the code yourself (but you already saw the result in your benchmark).

Ultimately then, you end up with a several loops that perform the required copy steps for non-trivial objects, such as calling the move constructor of the destination object, and subsequently calling the destructor of all the source objects. These are separate loops and even modern compilers will pretty much not be able to reason about something like "OK, in the first loop I moved all the destination objects so their ptr member will be null, so the second loop is a no-op". Finally, to equal the speed of raw pointers, not only would compilers need to optimize across these two loops, they would need to have a transformation which recognizes that the whole thing can be replaced by memcpy or memmove2.

So one answer to your question is that compilers just aren't smart enough to do this optimization, but it's largely because the "raw" version has a lot of compile-time help to skip the need for this optimization entirely.

Loop Fusion

As mentioned the existing vector implementations implement a resize-type operation in two separate loops (in addition to non-loop work such as allocating the new storage and freeing the old storage):

  • Copying the source objects into the newly allocated destination array (conceptually using something like placement new calling the move constructor).
  • Destroying the source objects in the old region.

Conceptually you could imagine an alternative way: doing this all in one loop, copying each element and them immediately destroying it. It possible that a compiler could even notice that the two loops iterate over the same set of values and fuse the two loops into one. [Apparently], howevever, (https://gcc.gnu.org/ml/gcc/2015-04/msg00291.html) gcc doesn't do any loop fusion today, and nor do clang or icc if you believe this test.

So then we are left trying to put the loops together explicitly at the source level.

Now the two-loop implementation helps preserve the exception safety contract of the operation by not destroying any source objects until we know the construction part of the copy has completed, but it also helps to optimize the copy and destruction when we have trivially-copyable and trivially-destructible objects, respectively. In particular, with simple-traits based selection we can replace the copy with a memmove and the destruction loop can be elided entirely3.

So the two-loop approach helps when those optimizations apply, but it actually hurts in the general case of objects which are neither trivially copyable or destructible. It means you need two passes over the objects and you lose the opportunity to optimize and eliminate code between the copy of the object and it's subsequent destruction. In the unique_ptr case you lose the ability for the compiler to propagate the knowledge that the source unique_ptr will have a NULL internal ptr member and hence skip the if (ptr) delete ptr check entirely4.

Trivially Movable

Now one might ask whether we could apply the same type-traits compile-time optimization to the unique_ptr case. For example, one might look at the trivially copyable requirements and see that they are perhaps too strict for the common move operations in std::vector. Sure, a unique_ptr is evidently not trivially copyable since a bit-wise copy would leave both the source and destination object owing the same pointer (and result in double-deletion), but it seems that it should be bit-wise movable: if you move a unique_ptr from one area of memory to another, such that you no longer consider the source as a live object (and hence won't call its destructor) it should "just work", for the typical unique_ptr implementation.

Unfortunately, no such "trivial move" concept exists, although you could try to roll your own. There seems to be an open debate about whether this is UB or not for objects that can be byte-wise copied and do not depend on their constructor or destructor behavior in the move scenario.

You could always implement your own trivially movable concept, which would be something like (a) the object has a trivial move constructor and (b) when used as the source argument of the move constructor the object is left in a state where it's destructor has no effect. Note that such a definition is currently mostly useless, since "trivial move constructor" (basically element-wise copy and nothing else) is not consistent with any modification of the source object. So for example, a trivial move constructor cannot set the ptr member of the source unique_ptr to zero. So you'd need to jump though some more hoops such as introducing the concept of a destructive move operation which leaves the source object destroyed, rather than in a valid-but-unspecified state.

You can find some more detailed discussion of this "trivially movable" on this thread on the ISO C++ usenet discussion group. In the particular, in the linked reply, the exact issue of vectors of unique_ptr is addressed:

It turns out many smart pointers (unique_ptr and shared_ptr included) fall into all three of those categories and by applying them you can have vectors of smart pointers with essentially zero overhead over raw pointers even in non-optimized debug builds.

See also the relocator proposal.


0 Although the non-vector examples at the end of your question show that this isn't always the case. Here it is due to possible aliasing as zneak explains in his answer. Raw pointers will avoid many of these aliasing issues since they lack the indirection that unique_ptr has (e.g, you pass a raw pointer by value, rather than a structure with a pointer by reference) and can often omit the if (ptr) delete ptr check entirely.

2 This is actually harder than you might think, because memmove, for example, has subtly different semantics than an object copy loop, when the source and destination overlap. Of course the high level type traits code that works for raw points knows (by contract) that there is no overlap, or the behavior of memmove is consistent even if there is overlap, but proving the same thing at some later arbitrary optimization pass may be much harder.

3 It is important to note that these optimizations are more or less independent. For example, many objects are trivially destructible that at are not trivially copyable.

4 Although in my test neither gcc nor clang were able to suppress the check, even with __restrict__ applied, apparently due to insufficiently powerful aliasing analysis, or perhaps because std::move strips the "restrict" qualifier somehow.

这篇关于调整大小std :: vector&lt; std :: unique_ptr&lt; T&gt;的性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆