使用SSE的矩阵向量和矩阵矩阵乘法 [英] Matrix-Vector and Matrix-Matrix multiplication using SSE

查看:127
本文介绍了使用SSE的矩阵向量和矩阵矩阵乘法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要编写矩阵向量和矩阵矩阵乘法函数,但是我不能将头放在SSE命令周围.

矩阵和向量的维数始终是4的倍数.

我设法编写了矢量-矢量乘法函数,如下所示:

  void vector_multiplication_SSE(float * m,float * n,float *结果,无符号const int大小){我__declspec(align(16))__ m128 * p_m =(__m128 *)m;__declspec(align(16))__ m128 * p_n =(__m128 *)n;__declspec(align(16))__ m128 * p_result =(__m128 *)结果;对于(i = 0; i< size/4; ++ i)p_result [i] = _mm_mul_ps(p_m [i],p_n [i]);//打印结果为(int i = 0; i< size; ++ i){如果(i%4 == 0)cout<<恩德尔cout<<结果[i]<<'\ t';}} 

现在我正在尝试实现矩阵向量乘法.

这是我到目前为止所拥有的:

  voidvoid_matrix_by_vector_SSE(float * m,float * v,float *结果,无符号const int vector_dims){int i,j;__declspec(align(16))__ m128 * p_m =(__m128 *)m;__declspec(align(16))__ m128 * p_v =(__m128 *)v;__declspec(align(16))__ m128 * p_result =(__m128 *)结果;对于(i = 0; i< vector_dims; i + = 4){__m128 tmp = _mm_load_ps(& result [i]);__m128 p_m_tmp = _mm_load_ps(& m [i]);tmp = _mm_add_ps(tmp,_mm_mul_ps(tmp,p_m_tmp));_mm_store_ps(& result [i],tmp);//另一个for循环在这里?}//打印结果对于(int i = 0; i< vector_dims; ++ i){如果(i%4 == 0)cout<<恩德尔cout<<结果[i]<<'\ t';}} 

此功能看起来完全错误.我的意思是不仅它不能正常工作,而且似乎我朝着错误的方向前进.


有人可以帮助我实现向量矩阵和矩阵矩阵乘法吗?我真的很感谢一些示例代码和非常详细的解释

更新

这是我的尝试号码2:

它失败,并显示 Access阅读冲突异常,但仍然感觉更近

  voidvoid_matrix_by_vector_SSE(float * m,float * v,float *结果,无符号const int vector_dims){int i,j;__declspec(align(16))__ m128 * p_m =(__m128 *)m;__declspec(align(16))__ m128 * p_v =(__m128 *)v;__declspec(align(16))__ m128 * p_result =(__m128 *)结果;对于(i = 0; i< vector_dims; ++ i){p_result [i] = _mm_mul_ps(_mm_load_ps(& m [i]),_mm_load_ps1(& v [i])));}//打印结果对于(int i = 0; i< vector_dims; ++ i){如果(i%4 == 0)cout<<恩德尔cout<<结果[i]<<'\ t';}} 

更新2

  voidvoid_matrix_by_vector_SSE(float * m,float * v,float *结果,无符号const int vector_dims){int i,j;__declspec(align(16))__ m128 * p_m =(__m128 *)m;__declspec(align(16))__ m128 * p_v =(__m128 *)v;__declspec(align(16))__ m128 * p_result =(__m128 *)结果;对于(i = 0; i< vector_dims; ++ i){对于(j = 0; j< vector_dims * vector_dims/4; ++ j){p_result [i] = _mm_mul_ps(p_v [i],p_m [j]);}}对于(int i = 0; i< vector_dims; ++ i){如果(i%4 == 0)cout<<恩德尔cout<<结果[i]<<'\ t';}cout<<恩德尔} 

解决方案

没有任何技巧,矩阵矢量乘法只是矢量和矩阵行之间的一堆点积.您的代码实际上没有这种结构.实际将其写为点积(未经测试):

  for(int row = 0; row< nrows; ++ row){__m128 acc = _mm_setzero_ps();//我只是假设列数是4的倍数for(int col = 0; col< ncols; col + = 4){__m128 vec = _mm_load_ps(& v [col]);//不要忘记它是一个矩阵,进行二维寻址__m128 mat = _mm_load_ps(& m [col + ncols *行]);acc = _mm_add_ps(acc,_mm_mul_ps(mat,vec));}//现在我们有4个acc浮点数,必须将它们相加//可以为此使用两个水平加法,它们有点烂,但是这//也不是内部循环.acc = _mm_hadd_ps(acc,acc);acc = _mm_hadd_ps(acc,acc);//存储结果,它是单个浮点数_mm_store_ss(& result [row],acc);} 

有一些明显的技巧,例如一次处理几行,重用向量的负载以及创建几个独立的依赖链,以便您可以更好地利用吞吐量(请参见下文).另外一个非常简单的技巧是将FMA用于mul/add组合,但是支持还不那么广泛(虽然不是在2015年,但现在在2020年相当普遍).

您可以据此建立矩阵矩阵乘法(如果您改变结果的位置),但这并不是最佳选择(请参见下文).


一次处理四行(未经测试):

  for(int row = 0; row< nrows; row + = 4){__m128 acc0 = _mm_setzero_ps();__m128 acc1 = _mm_setzero_ps();__m128 acc2 = _mm_setzero_ps();__m128 acc3 = _mm_setzero_ps();for(int col = 0; col< ncols; col + = 4){__m128 vec = _mm_load_ps(& v [col]);__m128 mat0 = _mm_load_ps(& m [col + ncols *行]);__m128 mat1 = _mm_load_ps(& m [col + ncols *(row + 1)]);__m128 mat2 = _mm_load_ps(& m [col + ncols *(row + 2)]);;__m128 mat3 = _mm_load_ps(& m [col + ncols *(row + 3)]);acc0 = _mm_add_ps(acc0,_mm_mul_ps(mat0,vec));acc1 = _mm_add_ps(acc1,_mm_mul_ps(mat1,vec));acc2 = _mm_add_ps(acc2,_mm_mul_ps(mat2,vec));acc3 = _mm_add_ps(acc3,_mm_mul_ps(mat3,vec));}acc0 = _mm_hadd_ps(acc0,acc1);acc2 = _mm_hadd_ps(acc2,acc3);acc0 = _mm_hadd_ps(acc0,acc2);_mm_store_ps(& result [row],acc0);} 

现在,每4个FMA只有5个负载,而在未行展开的版本中,每1个FMA有2个负载.此外,还有4个独立的FMA,或没有FMA收缩的添加/多对,这两种方式都增加了流水线式/同时执行的可能性.实际上,您可能想展开得更多,例如Skylake可以每个周期启动2个独立的FMA,并且它们需要4个周期才能完成,因此要完全占用两个FMA单元,您需要8个独立的FMA.作为奖励,最后这3个水平相加可以很好地进行水平求和.


最初不同的数据布局似乎是一个缺点,不再可能简单地从矩阵和向量中进行向量加载并将它们相乘(这会使第一个矩阵的微小行向量乘以微小的<再次是第二个矩阵的em> row 向量,这是错误的).但是,完全矩阵-矩阵乘法可以利用以下事实:它实际上是将矩阵与许多独立的向量相乘,这充满了需要完成的独立工作.水平和也可以很容易地避免.因此,实际上它比矩阵向量乘法更方便.

关键是从矩阵A提取一点列向量,从矩阵B提取一点行向量,然后将它们相乘成一个小的矩阵.与您惯常的习惯相比,这听起来可能是相反的,但是通过SIMD这样做会更好,因为计算始终保持独立且无水平操作.

例如(未经测试,假设矩阵的尺寸可被展开因子整除,需要x64,否则它将用尽寄存器)

  for(size_t i = 0; i< mat1rows; i + = 4){对于(size_t j = 0; j< mat2cols; j + = 8){float * mat1ptr =& mat1 [i * mat1cols];__m256 sumA_1,sumB_1,sumA_2,sumB_2,sumA_3,sumB_3,sumA_4,sumB_4;sumA_1 = _mm_setzero_ps();sumB_1 = _mm_setzero_ps();sumA_2 = _mm_setzero_ps();sumB_2 = _mm_setzero_ps();sumA_3 = _mm_setzero_ps();sumB_3 = _mm_setzero_ps();sumA_4 = _mm_setzero_ps();sumB_4 = _mm_setzero_ps();对于(size_t k = 0; k< mat2rows; ++ k){自动bc_mat1_1 = _mm_set1_ps(mat1ptr [0]);自动vecA_mat2 = _mm_load_ps(mat2 + m2idx);自动vecB_mat2 = _mm_load_ps(mat2 + m2idx + 4);sumA_1 = _mm_add_ps(_mm_mul_ps(bc_mat1_1,vecA_mat2),sumA_1);sumB_1 = _mm_add_ps(_mm_mul_ps(bc_mat1_1,vecB_mat2),sumB_1);自动bc_mat1_2 = _mm_set1_ps(mat1ptr [N]);sumA_2 = _mm_add_ps(_mm_mul_ps(bc_mat1_2,vecA_mat2),sumA_2);sumB_2 = _mm_add_ps(_mm_mul_ps(bc_mat1_2,vecB_mat2),sumB_2);自动bc_mat1_3 = _mm_set1_ps(mat1ptr [N * 2]);sumA_3 = _mm_add_ps(_mm_mul_ps(bc_mat1_3,vecA_mat2),sumA_3);sumB_3 = _mm_add_ps(_mm_mul_ps(bc_mat1_3,vecB_mat2),sumB_3);自动bc_mat1_4 = _mm_set1_ps(mat1ptr [N * 3]);sumA_4 = _mm_add_ps(_mm_mul_ps(bc_mat1_4,vecA_mat2),sumA_4);sumB_4 = _mm_add_ps(_mm_mul_ps(bc_mat1_4,vecB_mat2),sumB_4);m2idx + = 8;mat1ptr ++;}_mm_store_ps(& result [i * mat2cols + j],sumA_1);_mm_store_ps(& result [i * mat2cols + j + 4],sumB_1);_mm_store_ps(& result [(i + 1)* mat2cols + j],sumA_2);_mm_store_ps(& result [(i + 1)* mat2cols + j + 4],sumB_2);_mm_store_ps(& result [(i + 2)* mat2cols + j],sumA_3);_mm_store_ps(& result [(i + 2)* mat2cols + j + 4],sumB_3);_mm_store_ps(& result [(i + 3)* mat2cols + j],sumA_4);_mm_store_ps(& result [(i + 3)* mat2cols + j + 4],sumB_4);}} 

该代码的要点是,很容易将计算安排成非常易于SIMD的操作,它具有许多独立的算法来使浮点数单元饱和,同时使用的负载相对较少(否则可以成为瓶颈,甚至抛开它们可能会错过L1缓存,只是因为它们太多了.

您甚至可以使用此代码,但是它与Intel MKL相比没有竞争力.特别是对于中型或大型矩阵,其中平铺极为重要.轻松将其升级到AVX.它根本不适合微小的矩阵,例如将两个4x4矩阵相乘,请参见高效的4x4矩阵乘法.

I need to write matrix-vector and matrix-matrix multiplication functions but I cannot wrap my head around SSE commands.

The dimensions of matrices and vectors are always multiples of 4.

I managed to write the vector-vector multiplication function that looks like this:

void vector_multiplication_SSE(float* m, float* n, float* result, unsigned const int size)
{
    int i;

    __declspec(align(16))__m128 *p_m = (__m128*)m;
    __declspec(align(16))__m128 *p_n = (__m128*)n;
    __declspec(align(16))__m128 *p_result = (__m128*)result;

    for (i = 0; i < size / 4; ++i)
        p_result[i] = _mm_mul_ps(p_m[i], p_n[i]);

    // print the result
    for (int i = 0; i < size; ++i)
    {
        if (i % 4 == 0) cout << endl;
        cout << result[i] << '\t';
    }
}

and now I'm trying to implement matrix-vector multiplication.

Here's what I have so far:

void multiply_matrix_by_vector_SSE(float* m, float* v, float* result, unsigned const int vector_dims)
{
    int i, j;

    __declspec(align(16))__m128 *p_m = (__m128*)m;
    __declspec(align(16))__m128 *p_v = (__m128*)v;
    __declspec(align(16))__m128 *p_result = (__m128*)result;

    for (i = 0; i < vector_dims; i += 4)
    {
        __m128 tmp = _mm_load_ps(&result[i]);
        __m128 p_m_tmp = _mm_load_ps(&m[i]);

        tmp = _mm_add_ps(tmp, _mm_mul_ps(tmp, p_m_tmp));
        _mm_store_ps(&result[i], tmp);

        // another for loop here? 
    }

    // print the result
    for (int i = 0; i < vector_dims; ++i)
    {
        if (i % 4 == 0) cout << endl;
        cout << result[i] << '\t';
    }
}

This function looks completely wrong. I mean not only it doesn't work correctly, but it also seems that I'm moving in the wrong direction.


Could anyone help me with implementing vector-matrix and matrix-matrix multiplication? I'd really appreciate some piece of example code and a very detailed explanation

Update

Here's my attempt number 2:

it fails with Access reading violation exception but still feels closer

void multiply_matrix_by_vector_SSE(float* m, float* v, float* result, unsigned const int vector_dims)
{
    int i, j;

    __declspec(align(16))__m128 *p_m = (__m128*)m;
    __declspec(align(16))__m128 *p_v = (__m128*)v;
    __declspec(align(16))__m128 *p_result = (__m128*)result;

    for (i = 0; i < vector_dims; ++i)
    {
        p_result[i] = _mm_mul_ps(_mm_load_ps(&m[i]), _mm_load_ps1(&v[i]));
    }

    // print the result
    for (int i = 0; i < vector_dims; ++i)
    {
        if (i % 4 == 0) cout << endl;
        cout << result[i] << '\t';
    }
}

Update 2

void multiply_matrix_by_vector_SSE(float* m, float* v, float* result, unsigned const int vector_dims)
{
    int i, j;
    __declspec(align(16))__m128 *p_m = (__m128*)m;
    __declspec(align(16))__m128 *p_v = (__m128*)v;
    __declspec(align(16))__m128 *p_result = (__m128*)result;

    for (i = 0; i < vector_dims; ++i)
    {
        for (j = 0; j < vector_dims * vector_dims / 4; ++j)
        {
            p_result[i] = _mm_mul_ps(p_v[i], p_m[j]);
        }
    }

    for (int i = 0; i < vector_dims; ++i)
    {
        if (i % 4 == 0) cout << endl;
        cout << result[i] << '\t';
    }
    cout << endl;
}

解决方案

Without any tricks or anything, a matrix-vector multiplication is just a bunch of dot products between the vector and a row of the matrix. Your code doesn't really have that structure. Writing it actually as dot products (not tested):

for (int row = 0; row < nrows; ++row) {
    __m128 acc = _mm_setzero_ps();
    // I'm just going to assume the number of columns is a multiple of 4
    for (int col = 0; col < ncols; col += 4) {
        __m128 vec = _mm_load_ps(&v[col]);
        // don't forget it's a matrix, do 2d addressing
        __m128 mat = _mm_load_ps(&m[col + ncols * row]);
        acc = _mm_add_ps(acc, _mm_mul_ps(mat, vec));
    }
    // now we have 4 floats in acc and they have to be summed
    // can use two horizontal adds for this, they kind of suck but this
    // isn't the inner loop anyway.
    acc = _mm_hadd_ps(acc, acc);
    acc = _mm_hadd_ps(acc, acc);
    // store result, which is a single float
    _mm_store_ss(&result[row], acc);
}

There are some obvious tricks, such as processing several rows at once, reusing the load from the vector, and creating several independent dependency chains so you can make better use of the throughput (see below). Also a really simple trick is using FMA for the mul/add combo, but support is not that widespread yet (it wasn't in 2015, but it is fairly widespread now in 2020).

You can build matrix-matrix multiplication from this (if you change the place the result goes), but that is not optimal (see further below).


Taking four rows at once (not tested):

for (int row = 0; row < nrows; row += 4) {
    __m128 acc0 = _mm_setzero_ps();
    __m128 acc1 = _mm_setzero_ps();
    __m128 acc2 = _mm_setzero_ps();
    __m128 acc3 = _mm_setzero_ps();
    for (int col = 0; col < ncols; col += 4) {
        __m128 vec = _mm_load_ps(&v[col]);
        __m128 mat0 = _mm_load_ps(&m[col + ncols * row]);
        __m128 mat1 = _mm_load_ps(&m[col + ncols * (row + 1)]);
        __m128 mat2 = _mm_load_ps(&m[col + ncols * (row + 2)]);
        __m128 mat3 = _mm_load_ps(&m[col + ncols * (row + 3)]);
        acc0 = _mm_add_ps(acc0, _mm_mul_ps(mat0, vec));
        acc1 = _mm_add_ps(acc1, _mm_mul_ps(mat1, vec));
        acc2 = _mm_add_ps(acc2, _mm_mul_ps(mat2, vec));
        acc3 = _mm_add_ps(acc3, _mm_mul_ps(mat3, vec));
    }
    acc0 = _mm_hadd_ps(acc0, acc1);
    acc2 = _mm_hadd_ps(acc2, acc3);
    acc0 = _mm_hadd_ps(acc0, acc2);
    _mm_store_ps(&result[row], acc0);
}

There are only 5 loads per 4 FMAs now, versus 2 loads per 1 FMA in the version that wasn't row-unrolled. Also there are 4 independent FMAs, or add/mul pairs without FMA contraction, either way it increases the potential for pipelined/simultaneous execution. Actually you might want to unroll even more, for example Skylake can start 2 independent FMAs per cycle and they take 4 cycles to complete, so to completely occupy both FMA units you need 8 independent FMAs. As a bonus, those 3 horizontal adds in the end work out relatively nicely, for horizontal summation.


The different data layout initially seems like a disadvantage, it's no longer possible to simply do vector-loads from both the matrix and the vector and multiply them together (that would multiply a tiny row vector of the first matrix by a tiny row vector of the second matrix again, which is wrong). But full matrix-matrix multiplication can make use of the fact that it's essentially multiplying a matrix by lots of independent vectors, it's full of independent work to be done. The horizontal sums can be avoided easily too. So actually it's even more convenient than matrix-vector multiplication.

The key is taking a little column vector from matrix A and a little row vector from matrix B and multiplying them out into a small matrix. That may sound reversed compared to what you're used to, but doing it this way works out better with SIMD because the computations stay independent and horizontal-operation-free the whole time.

For example (not tested, assumes the matrixes have dimensions divisible by the unroll factors, requires x64 otherwise it runs out of registers)

for (size_t i = 0; i < mat1rows; i += 4) {
    for (size_t j = 0; j < mat2cols; j += 8) {
        float* mat1ptr = &mat1[i * mat1cols];
        __m256 sumA_1, sumB_1, sumA_2, sumB_2, sumA_3, sumB_3, sumA_4, sumB_4;
        sumA_1 = _mm_setzero_ps();
        sumB_1 = _mm_setzero_ps();
        sumA_2 = _mm_setzero_ps();
        sumB_2 = _mm_setzero_ps();
        sumA_3 = _mm_setzero_ps();
        sumB_3 = _mm_setzero_ps();
        sumA_4 = _mm_setzero_ps();
        sumB_4 = _mm_setzero_ps();

        for (size_t k = 0; k < mat2rows; ++k) {
            auto bc_mat1_1 = _mm_set1_ps(mat1ptr[0]);
            auto vecA_mat2 = _mm_load_ps(mat2 + m2idx);
            auto vecB_mat2 = _mm_load_ps(mat2 + m2idx + 4);
            sumA_1 = _mm_add_ps(_mm_mul_ps(bc_mat1_1, vecA_mat2), sumA_1);
            sumB_1 = _mm_add_ps(_mm_mul_ps(bc_mat1_1, vecB_mat2), sumB_1);
            auto bc_mat1_2 = _mm_set1_ps(mat1ptr[N]);
            sumA_2 = _mm_add_ps(_mm_mul_ps(bc_mat1_2, vecA_mat2), sumA_2);
            sumB_2 = _mm_add_ps(_mm_mul_ps(bc_mat1_2, vecB_mat2), sumB_2);
            auto bc_mat1_3 = _mm_set1_ps(mat1ptr[N * 2]);
            sumA_3 = _mm_add_ps(_mm_mul_ps(bc_mat1_3, vecA_mat2), sumA_3);
            sumB_3 = _mm_add_ps(_mm_mul_ps(bc_mat1_3, vecB_mat2), sumB_3);
            auto bc_mat1_4 = _mm_set1_ps(mat1ptr[N * 3]);
            sumA_4 = _mm_add_ps(_mm_mul_ps(bc_mat1_4, vecA_mat2), sumA_4);
            sumB_4 = _mm_add_ps(_mm_mul_ps(bc_mat1_4, vecB_mat2), sumB_4);
            m2idx += 8;
            mat1ptr++;
        }
        _mm_store_ps(&result[i * mat2cols + j], sumA_1);
        _mm_store_ps(&result[i * mat2cols + j + 4], sumB_1);
        _mm_store_ps(&result[(i + 1) * mat2cols + j], sumA_2);
        _mm_store_ps(&result[(i + 1) * mat2cols + j + 4], sumB_2);
        _mm_store_ps(&result[(i + 2) * mat2cols + j], sumA_3);
        _mm_store_ps(&result[(i + 2) * mat2cols + j + 4], sumB_3);
        _mm_store_ps(&result[(i + 3) * mat2cols + j], sumA_4);
        _mm_store_ps(&result[(i + 3) * mat2cols + j + 4], sumB_4);
    }
}

The point of that code is that it's easy to arrange to computation to be very SIMD-friendly, with a lots of independent arithmetic to saturate the floating point units with, and at the same time use relatively few loads (which otherwise could become a bottleneck, even putting aside that they might miss L1 cache, just by there being too many of them).

You can even use this code, but it's not competitive with Intel MKL. Especially for medium or big matrixes, where tiling is extremely important. It's easy to upgrade this to AVX. It's not suitable for tiny matrixes at all, for example to multiply two 4x4 matrixes see Efficient 4x4 matrix multiplication.

这篇关于使用SSE的矩阵向量和矩阵矩阵乘法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆