SSE矩阵矩阵乘法 [英] SSE matrix-matrix multiplication

查看:572
本文介绍了SSE矩阵矩阵乘法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在用C中的SSE进行矩阵矩阵乘法时遇到了麻烦.

这是我到目前为止所得到的:

#define N 1000

void matmulSSE(int mat1[N][N], int mat2[N][N], int result[N][N]) {
  int i, j, k;
  __m128i vA, vB, vR;

  for(i = 0; i < N; ++i) {
    for(j = 0; j < N; ++j) {
        vR = _mm_setzero_si128();
        for(k = 0; k < N; k += 4) {
            //result[i][j] += mat1[i][k] * mat2[k][j];
            vA = _mm_loadu_si128((__m128i*)&mat1[i][k]);
            vB = _mm_loadu_si128((__m128i*)&mat2[k][j]); //how well does the k += 4 work here? Should it be unrolled?
            vR = _mm_add_epi32(vR, _mm_mul_epi32(vA, vB));
        }
        vR = _mm_hadd_epi32(vR, vR);
        vR = _mm_hadd_epi32(vR, vR);
        result[i][j] += _mm_extract_epi32(vR, 0);
    }
  }
}

我似乎无法使它给出正确的结果.我想念什么吗? 而且搜索剂量似乎有很大帮助-每个结果要么只是做4x4矩阵,mat-vec,要么是某些特殊的魔术,这些魔术不是很可读,也很难理解...

解决方案

您是正确的,您的vB是问题所在.您正在加载4个连续的整数,但是mat2[k+0..3][j]不是连续的.您实际上正在获得mat2[k][j+0..3].


我忘记了什么对matmul很好.有时并行产生4个结果会很有效,而不是对每个结果进行水平求和.

转置您的输入矩阵之一有效,且成本为O(N ^ 2).这是值得的,因为这意味着O(N ^ 3)matmul可以使用顺序访问,并且您当前的循环结构对SIMD友好.

还有更好的方法,例如在使用前立即转置小块,因此当您再次读取它们时,它们在L1缓存中仍然很热.或循环遍历目标行并添加一个结果,而不是针对单个或一小组行*列点积累加完整的结果.高速缓存阻塞(又名循环切片)是确保良好mulmul性能的关键之一.另请参阅每个程序员都应了解的有关内存的信息?,其中有一个缓存附录中没有转置的阻塞SIMD FP matmul示例.

关于使用SIMD和缓存块优化矩阵乘法的文章很多.我建议你用谷歌搜索.大多数情况可能是在谈论FP,但是这同样适用于整数.

(除了SSE/AVX仅对FP具有FMA,对于32位整数没有FMA,并且8位和16位输入PMADD指令执行水平对对.)


实际上,我认为您可以在此处并行生成4个结果,如果已经移调了一个输入:

void matmulSSE(int mat1[N][N], int mat2[N][N], int result[N][N]) {

  for(int i = 0; i < N; ++i) {
    for(int j = 0; j < N; j+=4) {   // vectorize over this loop
        __m128i vR = _mm_setzero_si128();
        for(int k = 0; k < N; k++) {   // not this loop
            //result[i][j] += mat1[i][k] * mat2[k][j];
            __m128i vA = _mm_set1_epi32(mat1[i][k]);  // load+broadcast is much cheaper than MOVD + 3 inserts (or especially 4x insert, which your new code is doing)
            __m128i vB = _mm_loadu_si128((__m128i*)&mat2[k][j]);  // mat2[k][j+0..3]
            vR = _mm_add_epi32(vR, _mm_mullo_epi32(vA, vB));
        }
        _mm_storeu_si128((__m128i*)&result[i][j], vR));
    }
  }
}

广播负载(或没有AVX的单独负载+广播)仍然比聚会便宜得多.

您当前的代码使用4个插入来进行收集,而不是通过对第一个元素使用MOVD来打破对先前迭代值的依赖链,因此情况更糟.但是与负载+ PUNPCKLDQ相比,即使是4个分散元素的最佳集合也非常糟糕.更不用说这使您的代码需要SSE4.1.

尽管对于_mm_mullo_epi32它仍然需要SSE4.1,而不是扩展的 PMULDQ().

请注意,整数乘法吞吐量通常比FP乘法差,特别是在Haswell及更高版本上. FP FMA单元的每个32位元素(对于FP尾数)仅具有24位宽的乘法器,因此对于32x32 => 32位整数使用乘数需要分成两个.

I'm having trouble doing matrix-matrix multiplication with SSE in C.

Here is what I got so far:

#define N 1000

void matmulSSE(int mat1[N][N], int mat2[N][N], int result[N][N]) {
  int i, j, k;
  __m128i vA, vB, vR;

  for(i = 0; i < N; ++i) {
    for(j = 0; j < N; ++j) {
        vR = _mm_setzero_si128();
        for(k = 0; k < N; k += 4) {
            //result[i][j] += mat1[i][k] * mat2[k][j];
            vA = _mm_loadu_si128((__m128i*)&mat1[i][k]);
            vB = _mm_loadu_si128((__m128i*)&mat2[k][j]); //how well does the k += 4 work here? Should it be unrolled?
            vR = _mm_add_epi32(vR, _mm_mul_epi32(vA, vB));
        }
        vR = _mm_hadd_epi32(vR, vR);
        vR = _mm_hadd_epi32(vR, vR);
        result[i][j] += _mm_extract_epi32(vR, 0);
    }
  }
}

I can't seem to make it give the correct results. Am I missing something? And searching dosent seem to help much - every result is either only doing 4x4 matrices, mat-vec or some special magic thats not very readable and hard to understand...

解决方案

You're right, your vB is the problem. You're loading 4 consecutive integers, but mat2[k+0..3][j] aren't contiguous. You're actually getting mat2[k][j+0..3].


I forget what works well for matmul. Sometimes it works well to produce 4 results in parallel, instead of doing a horizontal sum for every result.

Transposing one of your input matrices works, and costs O(N^2). It's worth it because it means the O(N^3) matmul can use sequential accesses, and your current loop structure becomes SIMD-friendly.

There are even better ways, such as transposing small blocks right before use, so they're still hot in L1 cache when you read them again. Or looping over a destination row and adding in one result, instead of accumulating a full result for a single or small set of row*column dot products. Cache blocking, aka loop tiling, is one key to good matmul performance. See also What Every Programmer Should Know About Memory? which has a cache-blocked SIMD FP matmul example in an appendix without a transpose.

Much has been written about optimizing matrix multiplies, with SIMD and with cache-blocking. I suggest you google it up. Most if it is probably talking about FP, but it all applies to integer as well.

(Except that SSE/AVX only has FMA for FP, not for 32-bit integers, and the 8 and 16-bit input PMADD instructions do horizontal adds of pairs.)


Actually I think you can produce 4 results in parallel here, if one input has been transposed already:

void matmulSSE(int mat1[N][N], int mat2[N][N], int result[N][N]) {

  for(int i = 0; i < N; ++i) {
    for(int j = 0; j < N; j+=4) {   // vectorize over this loop
        __m128i vR = _mm_setzero_si128();
        for(int k = 0; k < N; k++) {   // not this loop
            //result[i][j] += mat1[i][k] * mat2[k][j];
            __m128i vA = _mm_set1_epi32(mat1[i][k]);  // load+broadcast is much cheaper than MOVD + 3 inserts (or especially 4x insert, which your new code is doing)
            __m128i vB = _mm_loadu_si128((__m128i*)&mat2[k][j]);  // mat2[k][j+0..3]
            vR = _mm_add_epi32(vR, _mm_mullo_epi32(vA, vB));
        }
        _mm_storeu_si128((__m128i*)&result[i][j], vR));
    }
  }
}

A broadcast-load (or separate load+broadcast without AVX) is still much cheaper than a gather.

Your current code does the gather with 4 inserts, instead of breaking the dependency chain on the previous iteration's value by using a MOVD for the first element, so that's even worse. But even the best gather of 4 scattered elements is pretty bad compared to a load + PUNPCKLDQ. Not to mention that that makes your code need SSE4.1.

Although it needs SSE4.1 anyway for _mm_mullo_epi32 instead of the widening PMULDQ (_mm_mul_epi32).

Note that integer multiply throughput is generally worse than FP multiply, especially on Haswell and later. FP FMA units only have 24-bit wide multipliers per 32-bit element (for FP mantissas) so using those for 32x32=>32-bit integer requires splitting into two uops.

这篇关于SSE矩阵矩阵乘法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆