用SSE计算矩阵乘积比使用直接算法慢得多 [英] Calculating matrix product is much slower with SSE than with straight-forward-algorithm

查看:106
本文介绍了用SSE计算矩阵乘积比使用直接算法慢得多的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想通过使用直接算法将两个矩阵相乘一次:

I want to multiply two matrices, one time by using the straight-forward-algorithm:

template <typename T>
void multiplicate_straight(T ** A, T ** B, T ** C, int sizeX)
{
    T ** D = AllocateDynamicArray2D<T>(sizeX, sizeX);
    transpose_matrix(B, D,sizeX);
    for(int i = 0; i < sizeX; i++)
    {
        for(int j = 0; j < sizeX; j++)
        {
            for(int g = 0; g < sizeX; g++)
            {
                C[i][j] += A[i][g]*D[j][g];
            }
        }
    }
    FreeDynamicArray2D<T>(D);
}

通过使用SSE功能一次.为此,我创建了两个函数:

and one time via using SSE functions. For this I created two functions:

template <typename T>
void SSE_vectormult(T * A, T * B, int size)
{

    __m128d a;
    __m128d b;
    __m128d c;
#ifdef linux
    double A2[2], B2[2], C[2] __attribute__ ((aligned(16)));
#endif
#ifdef _WIN32
    __declspec(align(16)) double A2[2], B2[2], C[2];
#endif
    for(int i = 0; i < size; i+=2)
    {
        //std::cout << "In SSE_vectormult: i is: " << i << '\n';
        A2[0] = A[i];
        B2[0] = B[i];
        A2[1] = A[i+1];
        B2[1] = B[i+1];
        //std::cout << "Values from A and B written to A2 and B2\n";
        a = _mm_load_pd(A2);
        b = _mm_load_pd(B2);
        //std::cout << "Values converted to a and b\n";
        c = _mm_mul_pd(a,b);
        _mm_store_pd(C, c);
        A[i] = C[0];
        A[i+1] = C[1];
    };
}

template <typename T>
void multiplicate_SSE(T ** A, T ** B, T ** C, int sizeX)
{
//    std::cout << "Entered SSE-Function\n";
    T ** D = AllocateDynamicArray2D<T>(sizeX, sizeX);
    T * tmp = AllocateDynamicArray1D<T>(sizeX);
    T * tmp2 = AllocateDynamicArray1D<T>(sizeX);
    //std::cout << "Matrices allocated\n";
    transpose_matrix<T>(B, D,sizeX);
    //std::cout << "Matrix B transposed\n";
    for(int i = 0; i < sizeX; i++)
    {
        for(int j = 0; j < sizeX; j++)
        {
            extract_row<T>(A,tmp, i, sizeX);
//            std::cout << "Row from A extracted\n";
            //print_vector(tmp, sizeX);
            extract_row<T>(D, tmp2, j, sizeX);
//            std::cout << "Row from D extracted\n";
            //print_vector(tmp2, sizeX);
            SSE_vectormult<T>(tmp, tmp2, sizeX);
//            std::cout << "Vectors multiplicated\n";
            //print_vector(tmp, sizeX);
            C[i][j] = add_vector(tmp, sizeX);
//            std::cout << "Written value to C\n";
//            std::cout << "j is " << j << " and i is " << i << '\n';
        }
    }
//    std::cout << "Loop finished\n";
    FreeDynamicArray2D<T>(D);
    //std::cout << "Freed D\n";
    //FreeDynamicArray1D<T>(tmp);????
//    std::cout << "Freed tmp\n";
    FreeDynamicArray1D<T>(tmp2);
//    std::cout << "Everything freed, returning\n";
}

但是然后我遇到了几个问题:一方面,当我要在multiplicate_SSE()中释放带有多个问号的tmp数组时,出现错误"_BLOCK_TYPE_IS_VALID".我考虑过两次释放相同空间的可能性,所以我对此没有评论(但是我想我会因此而发生内存泄漏?).现在,当我将两个函数的性能与相同的矩阵进行比较时,对于两个1024x1024矩阵,SSE函数所需的时间比简单方法要长大约四倍.
如何重写我的SSE功能以获得更好的性能(我以前从未使用过SSE),以及如何解决内存泄漏?
谢谢!

But then I get several problems: On the one hand, when I want to free the tmp-array in multiplicate_SSE(), marked with several question marks, I get the error "_BLOCK_TYPE_IS_VALID". I thought about the possibility of freeing the same space twice, so I uncommented this (but as I suppose I get a memory leak through this?). Now, when I compare the performance of both functions with identical matrices, the SSE function needs around four times longer for two 1024x1024-matrices than the straight-forward-method.
How can I rewrite my SSE-Function to get a better performance (I have never worked with SSE before), and how can I fix my memory leak?
Thank you!

推荐答案

您可以通过在标量代码中进行转置来获得正确的想法,但在使用SSE时并不需要完全正确的转置.

You have the right idea by taking the transpose in the scalar code but you don't want exactly the transpose when using SSE.

让我们坚持浮动(SGEMM).您要对SSE进行的操作是一次处理四个点产品.您要C = A*B.让我们看一个8x8矩阵.假设B是:

Let's stick to float (SGEMM). What you want to do with SSE is do four dot products at once. You want C = A*B. Let's look at a 8x8 matrix. Let's assume B is:

(0   1  2  3) ( 4  5  6  7)
(8   9 10 11) (12 13 14 15) 
(16 17 18 19) (20 21 22 23)
(24 25 26 27) (28 29 30 31)
(32 33 34 35) (36 37 38 39)
(40 41 42 43) (44 45 46 47)
(48 49 50 51) (52 53 54 55)
(56 57 58 59) (60 61 62 63)

因此,通过SSE,您可以做到:

So with SSE you do:

C[0][0] C[0][1] C[0][2] C[0][3] = 
A[0][0]*(0 1 2 3) + A[0][1]*(8 9 10 11) + A[0][2]*(16 17 18 19)...+ A[0][7]*(56 57 58 59)

一次可以得到四个点产品.问题是您必须向下移动B中的列,并且这些值不在同一高速缓存行中. 如果宽度为4的每一列在内存中是连续的,那会更好.因此,不像对每个元素进行转置一样,而是对宽度为4 的条进行转置,如下所示:

That gets you four dot products at once. The problem is that you have to move down a column in B and the values are not in the same cache line. It would be better if each column of width four was contiguous in memory. So instead of taking the transpose of each element you transpose strips with a width of 4 like this:

(0  1  2  3)( 8  9 10 11)(16 17 18 19)(24 25 26 27)(32 33 34 35)(40 41 42 43)(48 49 50 51)(56 57 58 59)
(4  5  6  7)(12 13 14 15)(20 21 22 23)(28 29 30 31)(36 37 38 39)(44 45 46 47)(52 53 54 55)(60 61 62 63)

如果将括号中的四个值中的每一个视为一个单位,则相当于将8x2矩阵转换为2x8矩阵.现在注意,B的宽度为4的列为在内存中连续.这对缓存更友好.对于8x8矩阵,这并不是真正的问题,但例如对于1024x1024矩阵,则是一个问题.请参阅下面的代码以了解如何执行此操作.对于AVX,您转置宽度为8的条带(这意味着您无需处理8x8矩阵).对于双倍宽度,使用SSE时为两个,使用AVX时为四个.

If you think of each of the four values in parentheses as one unit this is equivalent to transposing a 8x2 matrix to a 2x8 matrix. Notice now that the columns of width four of B are contiguous in memory. This is far more cache friendly. For an 8x8 matrix this is not really an issue but for example with a 1024x1024 matrix it is. See the code below for how to do this. For AVX you transpose strips of width 8 (which means you have nothing to do for a 8x8 matrix). For double the width is two with SSE and four with AVX.

假设矩阵适合高速缓存,这应该比标量代码快四倍.但是,对于大型矩阵,此方法仍将受内存限制,因此您的SSE代码可能不会比标量代码快很多(但应该不会更糟).

This should be four times faster than the scalar code assuming the matrices fit in the cache. However, for large matrices this method is still going to be memory bound and so your SSE code may not be much faster than scalar code (but it should not be worse).

但是,如果您使用循环平铺并在图块(适合L2缓存)中重新排列矩阵,而不是对整个矩阵进行矩阵乘法,那么即使是非常大的矩阵,乘法也受计算限制,而不受内存限制不适合L3缓存.这是另一个主题.

一些(未经测试的)代码与标量代码进行比较.我将循环展开了2.

some (untested) code to compare to your scalar code. I unrolled the loop by 2.

void SGEMM_SSE(const float *A, const float *B, float *C, const int sizeX) {
    const int simd_width = 4;
    const int unroll = 2;
    const int strip_width = simd_width*unroll
    float *D = (float*)_mm_malloc(sizeof(float)*sizeX*sizeX, 16);
    transpose_matrix_strip(B, D,sizeX, strip_width); //tranpose B in strips of width eight
    for(int i = 0; i < sizeX; i++) {
        for(int j = 0; j < sizeX; j+=strip_width) {
            float4 out_v1 = 0; //broadcast (0,0,0,0)
            float4 out_V2 = 0;
            //now calculate eight dot products
            for(int g = 0; g < sizeX; g++) {
                //load eight values rrom D into two SSE registers
                float4 vec4_1.load(&D[j*sizeX + strip_width*g]);
                float4 vec4_2.load(&D[j*sizeX + strip_width*g + simd_width]);
                out_v1 += A[i][g]*vec4_v1;
                out_v2 += A[i][g]*vec4_v2;
            }
            //store eight dot prodcuts into C
            out_v1.store(&C[i*sizeX + j]);
            out_v2.store(&C[i*sizeX + j + simd_width]);
        }
    }
    _mm_free(D);
}

void transpose_matrix_strip(const float* A, float* B, const int N, const int strip_width) {
    //#pragma omp parallel for
    for(int n=0; n<N*N; n++) {
        int k = strip_width*(n/N/strip_width);
        int i = (n/strip_width)%N;
        int j = n%strip_width;
        B[n] = A[N*i + k + j];
    }
}

请注意,j现在增加8.展开更多可能会有所帮助.如果要使用内部函数,则可以使用_mm_load_ps_mm_store_ps_mm_set1_ps(用于广播,例如_mm_set1_ps(A[i][g])),_mm_add_ps_mm_mul_ps.就是这样.

Notice that j increments by eight now. More unrolling may help. If you want to use intrinsics you can use _mm_load_ps, _mm_store_ps, _mm_set1_ps (for the broadcasts e.g. _mm_set1_ps(A[i][g])), _mm_add_ps, and _mm_mul_ps. That's it.

这篇关于用SSE计算矩阵乘积比使用直接算法慢得多的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆