SSE拷贝，AVX拷贝和std ::拷贝性能 [英] SSE-copy, AVX-copy and std::copy performance

查看：326 发布时间：2016/10/16 14:32:43 c++ performance sse simd avx

本文介绍了SSE拷贝，AVX拷贝和std ::拷贝性能的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我尝试通过SSE和AVX提高复制操作的性能：

  #include< immintrin.h> 
 
 const int sz = 1024; 
 float * mas =（float *）_ mm_malloc（sz * sizeof（float），16）; 
 float * tar =（float *）_ mm_malloc（sz * sizeof（float），16）; 
 float a = 0; 
 std :: generate（mas，mas + sz，[&]（）{return ++ a;}）; 
 
 const int nn = 1000; //测试器循环中的迭代次数
 std :: chrono :: time_point< std :: chrono :: system_clock> start1，end1，start2，end2，start3，end3; 
 
 // std :: copy testing 
 start1 = std :: chrono :: system_clock :: now（）; 
 for（int i = 0; i  std :: copy（mas，mas + sz，tar） 
 end1 = std :: chrono :: system_clock :: now（）; 
 float elapsed1 = std :: chrono :: duration_cast< std :: chrono :: microseconds>（end1-start1）.count（）; 
 
 // SSE复制测试
 start2 = std :: chrono :: system_clock :: now（）; 
 for（int i = 0; i  {
 auto _mas = mas; 
 auto _tar = tar; 
 for（; _mas！= mas + sz; _mas + = 4，_tar + = 4）
 {
 __m128 buffer = _mm_load_ps（_mas）; 
 _mm_store_ps（_tar，buffer）; 
} 
} 
 end2 = std :: chrono :: system_clock :: now（）; 
 float elapsed2 = std :: chrono :: duration_cast< std :: chrono :: microseconds>（end2-start2）.count（）; 
 
 // AVX复制测试
 start3 = std :: chrono :: system_clock :: now（）; 
 for（int i = 0; i  {
 auto _mas = mas; 
 auto _tar = tar; 
 for（; _mas！= mas + sz; _mas + = 8，_tar + = 8）
 {
 __m256 buffer = _mm256_load_ps（_mas）; 
 _mm256_store_ps（_tar，buffer）; 
} 
} 
 end3 = std :: chrono :: system_clock :: now（）; 
 float elapsed3 = std :: chrono :: duration_cast< std :: chrono :: microseconds>（end3-start3）.count（）; 
 
 std :: cout<<serial  - << elapsed1<<，SSE  - << elapsed2<<，AVX  - << elapsed3< ;<\ nSSE gain：<<< elapsed1 / elapsed2<<\\\
AVX gain：<< elapsed1 / elapsed3; 
 
 _mm_free（mas）; 
 _mm_free（tar）;

但是，虽然测试程序循环中的迭代次数 nn 增加，但simd-copy的性能提高却降低了：

nn = 10：SSE-gain = 3，AVX-gain = 6;

pn n = 100：SSE-

nn = 1000：SSE-gain = 0.55，AVX-gain = 1.1;

$ b b

任何人都可以解释所提到的性能降低效应的原因是什么，是否建议手动进行复制操作的向量化？

解决方案

<问题是，你的测试做一个不好的工作，以迁移硬件的一些因素，使基准测试困难。为了测试这个，我做了我自己的测试用例。像这样：

  for blah blah：
 sleep（500ms）
 std :: copy 
 sse 
 axv

输出：

$ b b

  SSE：比std :: copy更快1.11753x 
 AVX：比std :: copy更快1.81342x

所以在这种情况下，AVX比std :: copy要快一些。当我改变测试用例时会发生什么。

  for blah blah：
 sleep（500ms）
 sse 
 axv 
 std :: copy

请注意，

  SSE：比std :: copy 
快0.797673x AVX：0.809399x更快比std :: copy

Woah！怎么可能呢？ CPU需要一段时间来加速到全速，因此稍后运行的测试有一个优势。这个问题现在有3个答案，包括一个接受的答案。但是只有最少的upvote是在正确的轨道。

这是为什么基准测试是困难的，你绝不应该相信任何人的微观基准除非他们包括他们的设置的详细信息。它不只是代码可以出错。省电功能和奇怪的驱动程序可以完全搞乱你的基准。有一次，我通过在bios中切换不到1％的笔记本电脑提供的开关来测量性能上的因素7差异。

I'm tried to improve performance of copy operation via SSE and AVX:

    #include <immintrin.h>

    const int sz = 1024;
    float *mas = (float *)_mm_malloc(sz*sizeof(float), 16);
    float *tar = (float *)_mm_malloc(sz*sizeof(float), 16);
    float a=0;
    std::generate(mas, mas+sz, [&](){return ++a;});

    const int nn = 1000;//Number of iteration in tester loops    
    std::chrono::time_point<std::chrono::system_clock> start1, end1, start2, end2, start3, end3; 

    //std::copy testing
    start1 = std::chrono::system_clock::now();
    for(int i=0; i<nn; ++i)
        std::copy(mas, mas+sz, tar);
    end1 = std::chrono::system_clock::now();
    float elapsed1 = std::chrono::duration_cast<std::chrono::microseconds>(end1-start1).count();

    //SSE-copy testing
    start2 = std::chrono::system_clock::now();
    for(int i=0; i<nn; ++i)
    {
        auto _mas = mas;
        auto _tar = tar;
        for(; _mas!=mas+sz; _mas+=4, _tar+=4)
        {
           __m128 buffer = _mm_load_ps(_mas);
           _mm_store_ps(_tar, buffer);
        }
    }
    end2 = std::chrono::system_clock::now();
    float elapsed2 = std::chrono::duration_cast<std::chrono::microseconds>(end2-start2).count();

    //AVX-copy testing
    start3 = std::chrono::system_clock::now();
    for(int i=0; i<nn; ++i)
    {
        auto _mas = mas;
        auto _tar = tar;
        for(; _mas!=mas+sz; _mas+=8, _tar+=8)
        {
           __m256 buffer = _mm256_load_ps(_mas);
           _mm256_store_ps(_tar, buffer);
        }
    }
    end3 = std::chrono::system_clock::now();
    float elapsed3 = std::chrono::duration_cast<std::chrono::microseconds>(end3-start3).count();

    std::cout<<"serial - "<<elapsed1<<", SSE - "<<elapsed2<<", AVX - "<<elapsed3<<"\nSSE gain: "<<elapsed1/elapsed2<<"\nAVX gain: "<<elapsed1/elapsed3;

    _mm_free(mas);
    _mm_free(tar);

It works. However, while the number of iterations in tester-loops - nn - increases, performance gain of simd-copy decreases:

nn=10: SSE-gain=3, AVX-gain=6;

nn=100: SSE-gain=0.75, AVX-gain=1.5;

nn=1000: SSE-gain=0.55, AVX-gain=1.1;

Can anybody explain what is the reason of mentioned performance decrease effect and is it advisable to manually vectorization of copy operation?

解决方案

The problem is that your test does a poor job to migrate some factors in the hardware that make benchmarking hard. To test this, i've made my own test case. Something like this:

for blah blah:
    sleep(500ms)
    std::copy
    sse
    axv

output:

SSE: 1.11753x faster than std::copy
AVX: 1.81342x faster than std::copy

So in this case, AVX is a bunch faster than std::copy. What happens when i change to test case to..

for blah blah:
    sleep(500ms)
    sse
    axv
    std::copy

Notice that absolutely nothing changed, except the order of the tests.

SSE: 0.797673x faster than std::copy
AVX: 0.809399x faster than std::copy

Woah! how is that possible? The CPU takes a while to ramp up to full speed, so tests that are run later have an advantage. This question has 3 answers now, including an 'accepted' answer. But only the one with the lowest amount of upvotes was on the right track.

This is one of the reasons why benchmarking is hard and you should never trust anyone's micro-benchmarks unless they've included detailed information of their setup. It isn't just the code that can go wrong. Power saving features and weird drivers can completely mess up your benchmark. One time i've measured an factor 7 difference in performance by toggling a switch in the bios that less than 1% of notebooks offer.

这篇关于SSE拷贝，AVX拷贝和std ::拷贝性能的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

SSE拷贝，AVX拷贝和std ::拷贝性能 [英] SSE-copy, AVX-copy and std::copy performance

问题描述

相关文章

C/C++开发最新文章

热门教程

热门工具

登录关闭

SSE拷贝，AVX拷贝和std ::拷贝性能 [英] SSE-copy, AVX-copy and std::copy performance

问题描述

相关文章

C/C++开发最新文章

热门教程

热门工具

登录 关闭

登录关闭