缓存高效的矩阵转置程序? [英] A Cache Efficient Matrix Transpose Program?

查看:36
本文介绍了缓存高效的矩阵转置程序?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以转置矩阵的明显方法是使用:

So the obvious way to transpose a matrix is to use :

  for( int i = 0; i < n; i++ )

    for( int j = 0; j < n; j++ )

      destination[j+i*n] = source[i+j*n];

但我想要一些可以利用局部性和缓存阻塞的东西.我正在查找它并找不到可以执行此操作的代码,但我被告知它应该是对原始代码的非常简单的修改.有任何想法吗?

but I want something that will take advantage of locality and cache blocking. I was looking it up and can't find code that would do this, but I'm told it should be a very simple modification to the original. Any ideas?

我有一个 2000x2000 的矩阵,我想知道如何使用两个 for 循环更改代码,基本上将矩阵拆分为我单独转置的块,例如 2x2 块,或40x40 块,看看哪种块大小最有效.

I have a 2000x2000 matrix, and I want to know how can I change the code using two for loops, basically splitting the matrix into blocks that I transpose individually, say 2x2 blocks, or 40x40 blocks, and see which block size is most efficient.

Edit2:矩阵按列主序存储,也就是说对于一个矩阵

The matrices are stored in column major order, that is to say for a matrix

a1 a2    
a3 a4

存储为a1 a3 a2 a4.

推荐答案

您可能需要四个循环 - 两个循环遍历块,然后另外两个循环执行单个块的转置复制.为简单起见,假设一个块大小除以矩阵的大小,我认为是这样的,尽管我想在信封的背面画一些图片来确定:

You're probably going to want four loops - two to iterate over the blocks, and then another two to perform the transpose-copy of a single block. Assuming for simplicity a block size that divides the size of the matrix, something like this I think, although I'd want to draw some pictures on the backs of envelopes to be sure:

for (int i = 0; i < n; i += blocksize) {
    for (int j = 0; j < n; j += blocksize) {
        // transpose the block beginning at [i,j]
        for (int k = i; k < i + blocksize; ++k) {
            for (int l = j; l < j + blocksize; ++l) {
                dst[k + l*n] = src[l + k*n];
            }
        }
    }
}

另一个重要的见解是,实际上有一个缓存忽略算法(参见 http://en.wikipedia.org/wiki/Cache-oblivious_algorithm,以这个确切的问题为例).缓存遗忘"的非正式定义是您不需要尝试调整任何参数(在这种情况下是块大小)以达到良好/最佳的缓存性能.这种情况下的解决方案是通过递归地将矩阵分成两半,然后将两半转置到目标中的正确位置来进行转置.

An important further insight is that there's actually a cache-oblivious algorithm for this (see http://en.wikipedia.org/wiki/Cache-oblivious_algorithm, which uses this exact problem as an example). The informal definition of "cache-oblivious" is that you don't need to experiment tweaking any parameters (in this case the blocksize) in order to hit good/optimal cache performance. The solution in this case is to transpose by recursively dividing the matrix in half, and transposing the halves into their correct position in the destination.

无论实际缓存大小如何,此递归都会利用它.我希望与您的策略相比有一些额外的管理开销,即使用性能实验实际上直接跳转到递归中缓存真正启动的点,并且不再继续.另一方面,您的性能实验可能会给您一个适用于您的机器但不适用于您客户的机器的答案.

Whatever the cache size actually is, this recursion takes advantage of it. I expect there's a bit of extra management overhead compared with your strategy, which is to use performance experiments to, in effect, jump straight to the point in the recursion at which the cache really kicks in, and go no further. On the other hand, your performance experiments might give you an answer that works on your machine but not on your customers' machines.

这篇关于缓存高效的矩阵转置程序?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆