手动矢量化代码比自动优化慢10倍-我做错了什么? [英] Manually vectorized code 10x slower than auto optimized - what I did wrong?

查看:82
本文介绍了手动矢量化代码比自动优化慢10倍-我做错了什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试学习如何使用gcc进行矢量化.我按照 Erik Holk (其中此处的源代码)

I'm trying to learn how to exploit vectorization with gcc. I followed this tutorial of Erik Holk ( with source code here )

我刚刚将其修改为两倍.我使用此dotproduct来计算随机生成的double的平方1200x1200的乘法(300x300 double4).我检查了结果是否相同.但是真正令我惊讶的是,简单点积实际上比我手动矢量化的速度快10倍.

I just modified it to double. I used this dotproduct to compute multiplication of randomly generated square matrices 1200x1200 of doubles ( 300x300 double4 ). I checked that the results are the same. But what really surprised me is, that the simple dotproduct was actually 10x faster than my manually vectorized.

也许double4对于SSE来说太大(需要AVX2吗?),但是我希望即使在gcc无法一次找到合适的指令来处理double4的情况下,它仍然能够利用显式信息数据要进行大块自动矢量化.

maybe, double4 is too big for SSE ( it would need AVX2 ? ) But I would expect that even in case when gcc cannot find suitable instruction for dealing with double4 at once, it would still be able to exploit the explicit information that data are in big chunks for auto-vectorization.

详细信息:

结果是:

dot_simple:
time elapsed 1.90000 [s] for 1.728000e+09 evaluations => 9.094737e+08 [ops/s]

dot_SSE:
time elapsed 15.78000 [s] for 1.728000e+09 evaluations => 1.095057e+08 [ops/s]

我在带有这些选项-std=c99 -O3 -ftree-vectorize -unroll-loops --param max-unroll-times=4 -ffast-math或仅-O2的Intel®Core™i5 CPU 750 @ 2.67GHz×4上使用了gcc 4.6.3 (结果是相同的)

I used gcc 4.6.3 on Intel® Core™ i5 CPU 750 @ 2.67GHz × 4 with these options -std=c99 -O3 -ftree-vectorize -unroll-loops --param max-unroll-times=4 -ffast-math or with just -O2 ( the result was the same )

为方便起见,我使用python/scipy.weave()完成了此操作,但我希望它不会有任何改变

I did it using python/scipy.weave() for convenience, but I hope it doesn't change anything

代码:

double dot_simple(  int n, double *a, double *b ){
    double dot = 0;
    for (int i=0; i<n; i++){ 
        dot += a[i]*b[i];
    }
    return dot;
}

,并且该用户明确使用 gcc向量扩展

and that one using explicitly gcc vector extensiobns

double dot_SSE(  int n, double *a, double *b ){
    const int VECTOR_SIZE = 4;
    typedef double double4 __attribute__ ((vector_size (sizeof(double) * VECTOR_SIZE)));
    double4 sum4 = {0};
    double4* a4 = (double4 *)a;
    double4* b4 = (double4 *)b;
    for (int i=0; i<n; i++){ 
        sum4 += *a4 * *b4 ;
        a4++; b4++;
        //sum4 += a4[i] * b4[i];
    }
    union {  double4 sum4_; double sum[VECTOR_SIZE]; };
    sum4_ = sum4;
    return sum[0]+sum[1]+sum[2]+sum[3];
}

然后我将其用于300x300随机矩阵的乘法以衡量性能

Then I used it for multiplication of 300x300 random matrix to measure performance

void mmul( int n, double* A, double* B, double* C ){
    int n4 = n*4;
    for (int i=0; i<n4; i++){
        for (int j=0; j<n4; j++){
            double* Ai = A + n4*i;
            double* Bj = B + n4*j;
            C[ i*n4 + j ] =  dot_SSE( n, Ai, Bj );
            //C[ i*n4 + j ] =  dot_simple( n4, Ai, Bj );
            ijsum++;
        }
    }
}

密织代码:

def mmul_2(A, B, C, __force__=0 ):
    code = r'''     mmul( NA[0]/4, A, B, C );            '''
    weave_options = {
    'extra_compile_args': ['-std=c99 -O3 -ftree-vectorize -unroll-loops --param max-unroll-times=4 -ffast-math'],
    'compiler' : 'gcc', 'force' : __force__ }
    return weave.inline(code, ['A','B','C'], verbose=3, headers=['"vectortest.h"'],include_dirs=['.'], **weave_options )

推荐答案

主要问题之一是在函数dot_SSE中,当您只应循环n/2个项目(或n/4个)时,循环n个项目使用AVX).

One of the main problems is that in your function dot_SSE you loop over n items when you should only loop over n/2 items (or n/4 with AVX).

要使用GCC的矢量扩展名解决此问题,您可以执行以下操作:

To fix this with GCC's vector extensions you can do this:

double dot_double2(int n, double *a, double *b ) {
    typedef double double2 __attribute__ ((vector_size (16)));
    double2 sum2 = {};
    int i;
    double2* a2 = (double2*)a;
    double2* b2 = (double2*)b;
    for(i=0; i<n/2; i++) {
        sum2 += a2[i]*b2[i];
    }
    double dot = sum2[0] + sum2[1];
    for(i*=2;i<n; i++) dot +=a[i]*b[i]; 
    return dot;
}

代码的另一个问题是它具有依赖关系链.您的CPU可以同时进行SSE加法和乘法,但只能用于独立的数据路径.要解决此问题,您需要展开循环.以下代码将循环展开2倍(但您可能需要将循环展开3倍才能获得最佳效果).

The other problem with your code is that it has a dependency chain. Your CPU can do a simultaneous SSE addition and multiplication but only for independent data paths. To fix this you need to unroll the loop. The following code unrolls the loop by 2 (but you probably need to unroll by three for the best results).

double dot_double2_unroll2(int n, double *a, double *b ) {
    typedef double double2 __attribute__ ((vector_size (16)));
    double2 sum2_v1 = {};
    double2 sum2_v2 = {};
    int i;
    double2* a2 = (double2*)a;
    double2* b2 = (double2*)b;
    for(i=0; i<n/4; i++) {       
        sum2_v1 += a2[2*i+0]*b2[2*i+0];
        sum2_v2 += a2[2*i+1]*b2[2*i+1];
    }
    double dot = sum2_v1[0] + sum2_v1[1] + sum2_v2[0] + sum2_v2[1];
    for(i*=4;i<n; i++) dot +=a[i]*b[i]; 
    return dot;
}

这是一个使用double4的版本,我想这确实是您想要使用原始dot_SSE函数的版本.它是AVX的理想选择(尽管仍需要展开),但它仍将与SSE2一起使用.实际上,在SSE中,GCC似乎将其分为两个链,从而有效地将循环展开了2.

Here is a version using double4 which I think is really what you wanted with your original dot_SSE function. It's ideal for AVX (though it still needs to be unrolled) but it will still work with SSE2 as well. In fact with SSE it seems GCC breaks it into two chains which effectively unrolls the loop by 2.

double dot_double4(int n, double *a, double *b ) {
    typedef double double4 __attribute__ ((vector_size (32)));
    double4 sum4 = {};
    int i;
    double4* a4 = (double4*)a;
    double4* b4 = (double4*)b;
    for(i=0; i<n/4; i++) {       
        sum4 += a4[i]*b4[i];
    }
    double dot = sum4[0] + sum4[1] + sum4[2] + sum4[3];
    for(i*=4;i<n; i++) dot +=a[i]*b[i]; 
    return dot;
}

如果使用FMA进行编译,它将生成FMA3指令.我在这里测试了所有这些功能(您也可以自己编辑和编译代码) http://coliru .stacked-crooked.com/a/273268902c76b116

If you compile this with FMA it will generate FMA3 instructions. I tested all these functions here (you can edit and compile the code yourself as well) http://coliru.stacked-crooked.com/a/273268902c76b116

请注意,使用SSE/AVX进行矩阵乘法的单点生成不是SIMD的最佳使用.您应该使用SSE(AVX)一次处理两(四)个点积,以获得双浮点数.

Note that using SSE/AVX for a single dot production in matrix multiplication is not the optimal use of SIMD. You should do two (four) dot products at once with SSE (AVX) for double floating point.

这篇关于手动矢量化代码比自动优化慢10倍-我做错了什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆