SSE内部和循环展开 [英] SSE Intrinsics and loop unrolling

查看:113
本文介绍了SSE内部和循环展开的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试优化某些循环,并且已经进行了管理,但是我想知道是否仅将其部分纠正了.比如说我有这个循环:

I am attempting to optimise some loops and I have managed but I wonder if I have only done it partially correct. Say for example that I have this loop:

for(i=0;i<n;i++){
b[i] = a[i]*2;
}

将其展开3倍会产生以下结果:

unrolling this by a factor of 3, produces this:

int unroll = (n/4)*4;
for(i=0;i<unroll;i+=4)
{
b[i] = a[i]*2;
b[i+1] = a[i+1]*2;
b[i+2] = a[i+2]*2;
b[i+3] = a[i+3]*2;
}
for(;i<n;i++)
{
b[i] = a[i]*2;
} 

现在相当于SSE翻译:

Now is the SSE translation equivalent:

__m128 ai_v = _mm_loadu_ps(&a[i]);
__m128 two_v = _mm_set1_ps(2);
__m128 ai2_v = _mm_mul_ps(ai_v, two_v);
_mm_storeu_ps(&b[i], ai2_v);

或者是:

__m128 ai_v = _mm_loadu_ps(&a[i]);
__m128 two_v = _mm_set1_ps(2);
__m128 ai2_v = _mm_mul_ps(ai_v, two_v);
_mm_storeu_ps(&b[i], ai2_v);

__m128 ai1_v = _mm_loadu_ps(&a[i+1]);
__m128 two1_v = _mm_set1_ps(2);
__m128 ai_1_2_v = _mm_mul_ps(ai1_v, two1_v);
_mm_storeu_ps(&b[i+1], ai_1_2_v);

__m128 ai2_v = _mm_loadu_ps(&a[i+2]);
__m128 two2_v = _mm_set1_ps(2);
__m128 ai_2_2_v = _mm_mul_ps(ai2_v, two2_v);
_mm_storeu_ps(&b[i+2], ai_2_2_v);

__m128 ai3_v = _mm_loadu_ps(&a[i+3]);
__m128 two3_v = _mm_set1_ps(2);
__m128 ai_3_2_v = _mm_mul_ps(ai3_v, two3_v);
_mm_storeu_ps(&b[i+3], ai_3_2_v);

我对代码部分有些困惑:

I am slightly confused about the section of code:

for(;i<n;i++)
{
b[i] = a[i]*2;
}

这是做什么的?例如,如果无法根据您选择展开循环的因素来分割循环,是否只是做多余的部分?谢谢.

what does this do? Is it just to do the extra parts for example if the loop is not dividable by the factor you choose to unroll it by? Thank you.

推荐答案

答案是第一个块:

    __m128 ai_v = _mm_loadu_ps(&a[i]);
    __m128 two_v = _mm_set1_ps(2);
    __m128 ai2_v = _mm_mul_ps(ai_v,two_v);
    _mm_storeu_ps(&b[i],ai2_v);

一次已经需要四个变量.

It already takes four variables at a time.

这是完整的程序,其中等价的代码部分已被注释掉:

Here is the full program with the equivalent section of code commented out:

#include <iostream>

int main()
{
    int i{0};
    float a[10] ={1,2,3,4,5,6,7,8,9,10};
    float b[10] ={0,0,0,0,0,0,0,0,0,0};

    int n = 10;
    int unroll = (n/4)*4;
    for (i=0; i<unroll; i+=4) {
        //b[i] = a[i]*2;
        //b[i+1] = a[i+1]*2;
        //b[i+2] = a[i+2]*2;
        //b[i+3] = a[i+3]*2;
        __m128 ai_v = _mm_loadu_ps(&a[i]);
        __m128 two_v = _mm_set1_ps(2);
        __m128 ai2_v = _mm_mul_ps(ai_v,two_v);
        _mm_storeu_ps(&b[i],ai2_v);
    }

    for (; i<n; i++) {
        b[i] = a[i]*2;
    }

    for (auto i : a) { std::cout << i << "\t"; }
    std::cout << "\n";
    for (auto i : b) { std::cout << i << "\t"; }
    std::cout << "\n";

    return 0;
}

关于效率;看来我的系统上的程序集会生成movups指令,而手动编写的代码可以使用movaps来完成,这应该会更快.

As for efficiency; it seems that the assembly on my system generates movups instructions, whereas the hand rolled code could be made to use movaps which should be faster.

我使用以下程序进行了一些基准测试:

I used the following program to do some benchmarks:

#include <iostream>
//#define NO_UNROLL
//#define UNROLL
//#define SSE_UNROLL
#define SSE_UNROLL_ALIGNED

int main()
{
    const size_t array_size = 100003;
#ifdef SSE_UNROLL_ALIGNED
    __declspec(align(16)) int i{0};
    __declspec(align(16)) float a[array_size] ={1,2,3,4,5,6,7,8,9,10};
    __declspec(align(16)) float b[array_size] ={0,0,0,0,0,0,0,0,0,0};
#endif
#ifndef SSE_UNROLL_ALIGNED
    int i{0};
    float a[array_size] ={1,2,3,4,5,6,7,8,9,10};
    float b[array_size] ={0,0,0,0,0,0,0,0,0,0};
#endif

    int n = array_size;
    int unroll = (n/4)*4;


    for (size_t j{0}; j < 100000; ++j) {
#ifdef NO_UNROLL
        for (i=0; i<n; i++) {
            b[i] = a[i]*2;
        }
#endif
#ifdef UNROLL
        for (i=0; i<unroll; i+=4) {
            b[i] = a[i]*2;
            b[i+1] = a[i+1]*2;
            b[i+2] = a[i+2]*2;
            b[i+3] = a[i+3]*2;
        }
#endif
#ifdef SSE_UNROLL
        for (i=0; i<unroll; i+=4) {
            __m128 ai_v = _mm_loadu_ps(&a[i]);
            __m128 two_v = _mm_set1_ps(2);
            __m128 ai2_v = _mm_mul_ps(ai_v,two_v);
            _mm_storeu_ps(&b[i],ai2_v);
        }
#endif
#ifdef SSE_UNROLL_ALIGNED
        for (i=0; i<unroll; i+=4) {
            __m128 ai_v = _mm_load_ps(&a[i]);
            __m128 two_v = _mm_set1_ps(2);
            __m128 ai2_v = _mm_mul_ps(ai_v,two_v);
            _mm_store_ps(&b[i],ai2_v);
        }
#endif
#ifndef NO_UNROLL
        for (; i<n; i++) {
            b[i] = a[i]*2;
        }
#endif
    }

    //for (auto i : a) { std::cout << i << "\t"; }
    //std::cout << "\n";
    //for (auto i : b) { std::cout << i << "\t"; }
    //std::cout << "\n";

    return 0;
}

我得到了以下结果(x86):

I got the following results (x86):

  • NO_UNROLL: 0.994 秒,编译器未选择SSE
  • UNROLL: 3.511 秒,使用movups
  • SSE_UNROLL: 3.315 秒,使用movups
  • SSE_UNROLL_ALIGNED: 3.276 秒,使用movaps
  • NO_UNROLL: 0.994 seconds, no SSE chosen by compiler
  • UNROLL: 3.511 seconds, uses movups
  • SSE_UNROLL: 3.315 seconds, uses movups
  • SSE_UNROLL_ALIGNED: 3.276 seconds, uses movaps

因此很明显,在这种情况下展开循环并没有帮助.即使确保我们使用效率更高的movaps也无济于事.

So it is clear that unrolling the loop has not helped in this case. Even ensuring that we use the more efficient movaps doesn't help much.

但是当编译为64位(x64)时,我得到了一个甚至更奇怪的结果:

But I got an even stranger result when compiling to 64 bit (x64):

  • NO_UNROLL: 1.138 秒,编译器未选择SSE
  • UNROLL: 1.409 秒,编译器未选择SSE
  • SSE_UNROLL: 1.420 秒,编译器仍未选择SSE!
  • SSE_UNROLL_ALIGNED: 1.476 秒,编译器仍未选择SSE!
  • NO_UNROLL: 1.138 seconds, no SSE chosen by compiler
  • UNROLL: 1.409 seconds, no SSE chosen by compiler
  • SSE_UNROLL: 1.420 seconds, still no SSE chosen by compiler!
  • SSE_UNROLL_ALIGNED: 1.476 seconds, still no SSE chosen by compiler!

MSVC似乎可以仔细阅读建议并生成更好的程序集,尽管它比我们根本没有尝试任何手动优化的速度还要慢.

It seems MSVC sees through the proposal and generates better assembly regardless, albeit still slower than had we not tried any hand optimization at all.

这篇关于SSE内部和循环展开的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆