矩阵乘法:为什么非阻塞优于阻塞? [英] Matrix-Multiplication: Why non-blocked outperforms blocked?

查看：75 发布时间：2021/4/21 18:34:20 c caching matrix-multiplication cpu-architecture

本文介绍了矩阵乘法:为什么非阻塞优于阻塞?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试通过阻塞循环以提高缓存性能来加速矩阵乘法算法，但无论矩阵大小，块大小如何，非阻塞版本都保持明显更快的速度(我尝试了2至2之间的许多值200、2和其他的潜能)和优化级别.

I'm trying to speed up a matrix multiplication algorithm by blocking the loops to improve cache performance, yet the non-blocked version remains significantly faster regardless of matrix size, block size (I've tried lots of values between 2 and 200, potenses of 2 and others) and optimization level.

非阻止版本:

  for(size_t i = 0; i < n; ++i)
  {
    for(size_t k = 0; k < n; ++k)
    {
      int r = a[i][k];
      for(size_t j = 0; j < n; ++j)
      {
        c[i][j] += r * b[k][j];
      }
    }
  }

阻止版本:

  for(size_t kk = 0; kk < n; kk += BLOCK)
  {
    for(size_t jj = 0; jj < n; jj += BLOCK)
    {
      for(size_t i = 0; i < n; ++i)
      {
        for(size_t k = kk; k < kk + BLOCK; ++k)
        {
          int r = a[i][k];
          for(size_t j = jj; j < jj + BLOCK; ++j)
          {
            c[i][j] += r * b[k][j];
          }
        }
      }
    }
  }

我也有bijk版本和6循环bikj版本，但是它们都比非阻塞版本的表现要好，我不知道为什么会这样.我遇到的每篇论文和教程似乎都表明，被阻止的版本应该明显更快.如果这很重要，我将在Core i5上运行它.

I also have a bijk version and a 6-loops bikj version but they all gets outperformed by the non-blocked version and I don't get why this happens. Every paper and tutorial that I've come across seems to indicate that the the blocked version should be significantly faster. I'm running this on a Core i5 if that matters.

推荐答案

尝试仅在一个维度上进行阻止，而不是在两个维度上进行.

Try blocking in one dimension only, not in both dimensions.

矩阵乘法详尽地处理了两个矩阵中的元素.左矩阵上的每个行向量都被重复处理，并放入右矩阵的连续列中.

Matrix multiplication exhaustively processes elements from both matrices. Each row vector on the left matrix is repeatedly processed, taken into successive columns of the right matrix.

如果矩阵都不适合缓存，则某些数据将总是多次加载.

If the matrices do not both fit into the cache, some data will invariably end up loaded multiple times.

我们所能做的就是分解操作，以便一次处理大约一个缓存大小的数据.我们希望从左操作数开始的行向量被缓存，因为它被反复应用于多个列.但是，我们一次只能使用足够多的列以保持在缓存的限制内.例如，如果我们只能占用25％的列，则意味着我们将不得不越过行向量四次.我们最终从内存中加载了左矩阵四次，而右矩阵只加载了一次.

What we can do is break up the operation so that we work with about a cache-sized amount of data at one time. We want the row vector from the left operand to be cached, since it is repeatedly applied against multiple columns. But we should only take enough columns (at a time) to stay within the limit of the cache. For instance, if we can only take 25% of the columns, it means we will have to pass over the row vectors four times. We end up loading the left matrix from memory four times, and the right matrix only once.

(如果要多次加载任何东西，则应该是左侧的行向量，因为它们在内存中是平坦的，这得益于突发加载.许多缓存体系结构可以执行从内存到相邻缓存的突发加载.比随机访问负载更快的行速度.如果以列优先顺序存储正确的矩阵，那会更好:然后，我们在平面数组之间进行叉积运算，从而很好地预取到内存中.)

(If anything is to be loaded more than once, it should be the row vectors on the left, because they are flat in memory, which benefits from burst loading. Many cache architectures can perform a burst load from memory into adjacent cache lines faster than random access loads. If the right matrix were stored in column-major order, that would be even better: then we are doing cross-products between flat arrays, which prefetch into memory nicely.)

我们也不要忘记输出矩阵.输出矩阵也占用缓存中的空间.

Let's also not forget the output matrix. The output matrix occupies space in the cache also.

我怀疑2D阻止方法的一个缺陷是输出矩阵的每个元素都取决于两个输入:左矩阵中的整个整行和右矩阵中的整个列.如果以块为单位访问矩阵，则意味着要多次访问每个目标元素以累积部分结果.

I suspect one flaw in the 2D blocked approach is that each element of the output matrix depends on two inputs: its entire entire row in the left matrix, and the entire column in the right matrix. If the matrices are visited in blocks, that means that each target element is visited multiple times to accumulate the partial result.

如果我们要完成一个完整的行列点积，则不必多次访问 c [i] [j] ；一旦将列 j 放入行 i ，我们就完成了该 c [i] [j] .

If we do a complete row-column dot product, we don't have to visit the c[i][j] more than once; once we take column j into row i, we are done with that c[i][j].

这篇关于矩阵乘法:为什么非阻塞优于阻塞?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

矩阵乘法:为什么非阻塞优于阻塞? [英] Matrix-Multiplication: Why non-blocked outperforms blocked?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

矩阵乘法:为什么非阻塞优于阻塞? [英] Matrix-Multiplication: Why non-blocked outperforms blocked?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭