测量向量的性能< unique_ptr>在VS2013? [英] Measuring performance of vector<unique_ptr> on VS2013?

查看:217
本文介绍了测量向量的性能< unique_ptr>在VS2013?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

TL; DR 是VS2013的优化糊涂还是我测量错了还是事实上的全球假人需要挥发,使测试有效或 _ 的_ ?



免责声明:这主要是出于学术的兴趣,我不希望我看到的真正影响任何生产代码的差异。






简介:最近的一些测量结果让我得到这个问题因为我看到 STD之间显著差异::矢量<的std ::的unique_ptr< T> ; > boost :: ptr_vector 。 (另请参阅的评论)



我的具体测试用例,访问boost :: ptr_vector中的元素可以比使用unique_ptr的向量快50%!



我的测试代码在这里: http://coliru.stacked-crooked.com/a/27dc2f1b91380cca




  • gcc 4.8不报告此问题任何差异,所以这是一个VS2013的东西。

     开始... 
    时间如下,访问所有(1000000)元素200倍:
    * St6vectorISt10unique_ptrIjSt14default_deleteIjEESaIS3_EE:1764毫秒
    * N5boost10ptr_vectorIjNS_20heap_clone_allocatorESaIPvEEE:1781毫秒
    虚设输出1:50万

      

    开始...
    定时如下访问所有(1.000.000)元素200次:
    * class std :: vector< ....>:344 ms
    * class boost :: ptr_vector< unsigned int,...>:216 ms
    Dummy output:500.000




测试循环看起来像这样,我还会保留长篇评论,解释我看到的:

  template< typename C> 
void RunContainerAccess(C& c){
for(size_t i = 0; i!= loop_count; ++ i){
for(auto& e:c){
//这是相关的:
//如果if条件存在,VC ++ 2013将显示
//约。两种情况下的相同运行时。然而,
//如果这个循环中的唯一一行是将元素
//分配给pGlobalDummy,则* ptr_vector将运行约。 50%
//比unique_vector版本快!
//
// G ++ - 4.8并没有显示这种行为
//
//注:VS2013条命令行:(释放; / O2;无全PRG OPT)
// / GS / analyze- / W3 / Gy / Zc:wchar_t / Zi / Gm- / O2 / sdl /Fd\"Release\vc120.pdb/ fp:precise / DWIN32/ DNDEBUG / D_CONSOLE/ D_LIB/ D_UNICODE/ DUNICODE/ errorReport:prompt / WX- / Zc:forScope / Gd / Oy- / Oi / MD / FaRelease\/ EHsc / nologo / FoRelease\/Fp\"Release\simple.pch
//
//注意:http://coliru.stacked-crooked.com/命令行:
// g ++ - 4.8 -std = c ++ 11 -O2 -Wall -pedantic -pthread main.cpp&& ./a.out

// if(pGlobalDummy)
pGlobalDummy = PtrShim(e);
}
}
}

在循环中访问该元素(将ptr放入全局虚拟对象),那么看起来VS2013优化器会有些奇怪。当 if(pGlobalDummy)存在时,两种情况是相同的。



任何人都可以在这个?



由于霍华德的回答,我发现添加 volatile 到全局虚拟变量,全局dumy是这样挥发的:

  extern MyType * volatile pGlobalDummy; 
MyType * volatile pGlobalDummy = nullptr;

循环运行有点慢,但运行完全相同。 是否volatile会在这里产生影响?也就是说,测试即使没有volatile也是有效的。

解决方案

我在您的测试中发现了一个错误,它允许优化器以不同的和不可预测的方式进行优化。我不知道这是否影响你的结果。但它确实对我有影响。



我使用tip-of-trunk clang + libc ++ -O3。



当我运行你的代码未修改我获得:

 开始... 
时间如下访问所有(1,000,000)元的200倍:
* NSt3__16vectorINS_10unique_ptrIjNS_14default_deleteIjEEEENS_9allocatorIS4_EEEE:0毫秒
* N5boost10ptr_vectorIjNS_20heap_clone_allocatorENSt3__19allocatorIPvEEEE:0毫秒
虚设输出50万

我将输出单位改为纳秒,并得到:

  
的时序是用来访问所有(1,000,000)元200次如下:
* NSt3__16vectorINS_10unique_ptrIjNS_14default_deleteIjEEEENS_9allocatorIS4_EEEE:32纳秒
* N5boost10ptr_vectorIjNS_20heap_clone_allocatorENSt3__19allocatorIPvEEEE:32纳秒
虚设输出50万

可疑,我在这里插入volatile:

 code> extern MyType *< ins> volatile< / ins> pGlobalDummy; 
MyType *< ins> volatile< / ins> pGlobalDummy = nullptr;

但不变更。



注意到 time [2] 没有被初始化,所以我:

  chron :: nanoseconds time [2]< ins> = {}< / ins> ;; 

这样做了。现在将单位设置回毫秒我得到:

 开始... 
时间如下访问所有1,000,000)元素200倍:
* NSt3__16vectorINS_10unique_ptrIjNS_14default_deleteIjEEEENS_9allocatorIS4_EEEE:394毫秒
* N5boost10ptr_vectorIjNS_20heap_clone_allocatorENSt3__19allocatorIPvEEEE:406毫秒
虚设输出50万

所以我很好奇,如果你明确地将你的时间[2] 零,你可能需要:

  chron :: nanoseconds time [2] = {chron :: nanoseconds(0),chron :: nanoseconds 

这会影响您看到的结果吗?



澄清



std :: chrono :: duration 默认构造函数指定为:

  constexpr duration()= default; 

这将会默认初始化 / code>''如果客户端没有指定 list-initialization ,例如:

  chrono :: nanoseconds ns; // default-initialized 

rep 算术类型,不执行初始化([dcl.init] / p7 / b3)。



如果客户端列表初始化,例如:

  chrono :: nanoseconds ns {}; // list-initialized 

然后 rep 值初始化([dcl.init.list] / p3 / b7),对于算术类型,值初始化 ([dcl.init] / p8 / b4)。



完整工作范例:

  #include< iostream> 
#include< chrono>

int
main()
{
std :: chrono :: nanoseconds n1;
std :: chrono :: nanoseconds n2 {};
std :: chrono :: nanoseconds n3 = {};
std :: cout<< n1 =<< n1.count()<< ns\\\
;
std :: cout<< n2 =<< n2.count()<< ns\\\
;
std :: cout<< n3 =<< n3.count()< ns\\\
;
}

对我来说,当用-O0编译时,我得到:

  n1 = 0ns 
n2 = 0ns
n3 = 0ns
  n1 = 32ns 
n2 = 0ns
n3 = 0ns


TL;DR Is the optimizer of VS2013 confused or are my measurements wrong or does the global dummy in fact need to be volatile to make the test valid or __ ?

Disclaimer: This is mostly out of "academic" interest, I would not expect the differences I see to really affect any production code.


Intro: Some recent measurements of mine then led me to this question because I saw significant differences between std::vector<std::unique_ptr<T> > and boost::ptr_vector on VS2013. (also see comments there)

It would appear that for my specific test case, accessing elements in a boost::ptr_vector can be 50% faster than using a vector of unique_ptr!

My Test code is here: http://coliru.stacked-crooked.com/a/27dc2f1b91380cca (I'll refrain from also including it in this question, I'll include snippets below)

  • gcc 4.8 doesn't report any differences, so this is a VS2013 thing.

    Start...
    The timings are as follows for accessing all (1000000) elements 200 times:
    * St6vectorISt10unique_ptrIjSt14default_deleteIjEESaIS3_EE: 1764 ms
    * N5boost10ptr_vectorIjNS_20heap_clone_allocatorESaIPvEEE: 1781 ms
    Dummy output: 500000
    

  • My Timings for exactly the test code linked to are:

    Start...
    The timings are as follows for accessing all (1.000.000) elements 200 times:
    * class std::vector<....>: 344 ms
    * class boost::ptr_vector<unsigned int,....>: 216 ms
    Dummy output: 500.000
    

The test loop looks like this, I'll also keep the lengthy comment there which explains what I see:

template<typename C>
void RunContainerAccess(C& c) {
    for (size_t i = 0; i != loop_count; ++i) {
        for (auto& e : c) {
            // This is relevant: 
            // If the if-condition is present, VC++2013 will show 
            // approx. the same runtime for both cases. However,
            // if the only line in this loop is assigning the element
            // to the pGlobalDummy, *then* ptr_vector will run approx. 50%
            // faster than the unique_vector version!
            //
            // g++-4.8 does not show this behaviour
            //
            // Note: VS2013 commmand line: (release; /O2; no whole prg opt)
            //   /GS /analyze- /W3 /Gy /Zc:wchar_t /Zi /Gm- /O2 /sdl /Fd"Release\vc120.pdb" /fp:precise /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_LIB" /D "_UNICODE" /D "UNICODE" /errorReport:prompt /WX- /Zc:forScope /Gd /Oy- /Oi /MD /Fa"Release\" /EHsc /nologo /Fo"Release\" /Fp"Release\simple.pch" 
            //
            // Note: http://coliru.stacked-crooked.com/ command line:
            //   g++-4.8 -std=c++11 -O2 -Wall -pedantic -pthread main.cpp && ./a.out

            // if (pGlobalDummy)
                pGlobalDummy = PtrShim(e);
        }
    }
}

If the only line in the loop is accessing the element (putting the ptr into a global dummy), then it would appear that the VS2013 optimizer does something weird. When the if (pGlobalDummy) is present, both cases are the same.

Can anyone share some light on this?

Thanks to Howard's answer I did find that adding volatile to the global dummy make a difference, i.e. when the global dumy is volatile like so:

extern MyType* volatile pGlobalDummy;
MyType* volatile pGlobalDummy = nullptr;

The loops run a bit slower, but run exactly the same. Should volatile make a difference here? That is, is the test even valid without volatile?

解决方案

I found a bug in your test, which gives permission for optimizers to optimize in different and unpredictable ways. I don't know for sure that this is impacting your results. But it sure impacted mine.

I'm using tip-of-trunk clang + libc++ -O3.

When I run your code unmodified I get:

Start...
The timings are as follows for accessing all (1,000,000) elements 200 times:
* NSt3__16vectorINS_10unique_ptrIjNS_14default_deleteIjEEEENS_9allocatorIS4_EEEE: 0 ms
* N5boost10ptr_vectorIjNS_20heap_clone_allocatorENSt3__19allocatorIPvEEEE: 0 ms
Dummy output: 500,000

I changed the output units to nanoseconds and got:

Start...
The timings are as follows for accessing all (1,000,000) elements 200 times:
* NSt3__16vectorINS_10unique_ptrIjNS_14default_deleteIjEEEENS_9allocatorIS4_EEEE: 32 ns
* N5boost10ptr_vectorIjNS_20heap_clone_allocatorENSt3__19allocatorIPvEEEE: 32 ns
Dummy output: 500,000

Suspicious, I inserted volatile here:

extern MyType* <ins>volatile</ins> pGlobalDummy;
MyType* <ins>volatile</ins> pGlobalDummy = nullptr;

but no change.

Then I noted that time[2] isn't being initialized, so I:

chron::nanoseconds time[2]<ins> = {}</ins>;

That did it. Now setting units back to milliseconds I get:

Start...
The timings are as follows for accessing all (1,000,000) elements 200 times:
* NSt3__16vectorINS_10unique_ptrIjNS_14default_deleteIjEEEENS_9allocatorIS4_EEEE: 394 ms
* N5boost10ptr_vectorIjNS_20heap_clone_allocatorENSt3__19allocatorIPvEEEE: 406 ms
Dummy output: 500,000

So I'm curious, if you explicitly zero your time[2], you may need to:

chron::nanoseconds time[2] = {chron::nanoseconds(0), chron::nanoseconds(0)};

does this impact the results you are seeing?

Clarification

The std::chrono::duration default constructor is specified as:

constexpr duration() = default;

This will default-initialize the duration's rep if the client does not specify list-initialization, e.g.:

chrono::nanoseconds ns;  // default-initialized

When rep is an arithmetic type, no initialization is performed ([dcl.init]/p7/b3).

If the client list-initializes, e.g.:

chrono::nanoseconds ns{};  // list-initialized

Then rep is value-initialized ([dcl.init.list]/p3/b7), and for arithmetic types, value-initialization is the same thing as zero-initialization ([dcl.init]/p8/b4).

Full working example:

#include <iostream>
#include <chrono>

int
main()
{
    std::chrono::nanoseconds n1;
    std::chrono::nanoseconds n2{};
    std::chrono::nanoseconds n3 = {};
    std::cout << "n1 = " << n1.count() << "ns\n";
    std::cout << "n2 = " << n2.count() << "ns\n";
    std::cout << "n3 = " << n3.count() << "ns\n";
}

For me, when compiled with -O0 I get:

n1 = 0ns
n2 = 0ns
n3 = 0ns

But compiling the same thing with -O3, this changes to:

n1 = 32ns
n2 = 0ns
n3 = 0ns

这篇关于测量向量的性能&lt; unique_ptr&gt;在VS2013?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆