使用 SSE 时性能变差(整数数组的简单加法) [英] Performance worsens when using SSE (Simple addition of integer arrays)

查看:57
本文介绍了使用 SSE 时性能变差(整数数组的简单加法)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 SSE 内在函数来添加两个 32 位有符号整数数组.但与线性加法相比,我的性能非常差.

I'm trying to use SSE intrinsics to add two 32-bit signed int arrays. But I'm getting very poor performance compared to a linear addition.

平台 - Intel Core i3 550、GCC 4.4.3、Ubuntu 10.04(有点旧,是的)

Platform - Intel Core i3 550, GCC 4.4.3, Ubuntu 10.04 (bit old, yeah)

#define ITER 1000
typedef union sint4_u {
        __m128i v;
        sint32_t x[4];
} sint4;

功能:

void compute(sint32_t *a, sint32_t *b, sint32_t *c) {
        sint32_t len = 96000;
        sint32_t i, j;

        __m128i x __attribute__ ((aligned(16)));
        __m128i y __attribute__ ((aligned(16)));
        sint4 z;

        for(j = 0; j < ITER; j++) {
                for(i = 0; i < len; i += 4) {
                        x = _mm_set_epi32(a[i + 0], a[i + 1], a[i + 2], a[i + 3]);
                        y = _mm_set_epi32(b[i + 0], b[i + 1], b[i + 2], b[i + 3]);
                        z.v = _mm_add_epi32(x, y); 
                        c[i + 0] = z.x[3];
                        c[i + 1] = z.x[2];
                        c[i + 2] = z.x[1];
                        c[i + 3] = z.x[0];
                }   
        }   

        return;
}

void compute_s(sint32_t *a, sint32_t *b, sint32_t *c) {
        sint32_t len = 96000;
        sint32_t i, j;
        for(j = 0; j < ITER; j++) {
                for(i = 0; i < len; i++) {
                        c[i] = a[i] + b[i];
                }   
        }   
        return;
}

结果:

➜  C  gcc -msse4.2 simd.c
➜  C  ./a.out            
Time Elapsed (SSE): 612.520000 mS
Time Elapsed (Scalar): 401.713000 mS
➜  C  gcc -O3 -msse4.2 simd.c
➜  C  ./a.out                
Time Elapsed (SSE): 135.124000 mS
Time Elapsed (Scalar): 46.438000 mS

使用 -O3 时,SSE 版本会变慢 3 倍 (!!).我究竟做错了什么?即使我在 compute 中跳过加载回到 c,它仍然需要额外的 100 毫秒而没有任何优化.

On using -O3, the SSE version becomes 3 times slower (!!). What am I doing wrong? Even if I skip the loading back to c in compute, it still takes an extra 100 ms without any optimizations.

编辑 - 根据评论中的建议,我用 _mm_load 替换了 _mm_set,这是更新的时间 -

➜  C    gcc audproc.c -msse4    
➜  C    ./a.out             
Time Elapsed (SSE): 303.931000 mS
Time Elapsed (Scalar): 413.701000 mS
➜  C    gcc -O3 audproc.c -msse4
➜  C    ./a.out                 
Time Elapsed (SSE): 82.532000 mS
Time Elapsed (Scalar): 48.104000 mS

好多了,但仍远不及 4 倍的理论增益.另外,为什么我的矢量化在 O3 上变慢了?另外,我如何摆脱这个警告?(我尝试将 __vector__ 添加到我的声明中,但收到了更多警告.:( )

Much much better, but still nowhere close to the theoretical gain of 4x. Also, why is my vectorization slower at O3? Also, how do I get rid of this warning? (I tried adding __vector__ to my declaration but got more warnings instead. :( )

audproc.c: In function ‘compute’:
audproc.c:54: warning: passing argument 1 of ‘_mm_load_si128’ from incompatible pointer type /usr/lib/gcc/i486-linux-gnu/4.4.3/include/emmintrin.h:677: note: expected ‘const long long int __vector__ *’ but argument is of type ‘const sint32_t *’

推荐答案

正如评论中已经提到的,为了获得 SIMD 的性能优势,您应该避免循环中的标量操作,即摆脱 _mm_set_epi32 伪内部函数和用于存储 SIMD 结果的并集.这是您的函数的固定版本:

As already mentioned in the comments, in order to get the performance benefits of SIMD you should avoid scalar operations in your loop, i.e. get rid of the _mm_set_epi32 pseudo-intrinsics and the union for storing SIMD results. Here is a fixed version of your function:

void compute(const sint32_t *a, const sint32_t *b, sint32_t *c)
{
    sint32_t len = 96000;
    sint32_t i, j;

    for(j = 0; j < ITER; j++)
    {
        for(i = 0; i < len; i += 4)
        {
            __m128i x = _mm_loadu_si128((__m128i *)&a[i]);
            __m128i y = _mm_loadu_si128((__m128i *)&b[i]);
            __m128i z = _mm_add_epi32(x, y); 
            _mm_storeu_si128((__m128i *)&c[i], z);
        }   
    }   
}

这篇关于使用 SSE 时性能变差(整数数组的简单加法)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆