高速缓存有效的矩阵转置计划? [英] A Cache Efficient Matrix Transpose Program?

查看:2544
本文介绍了高速缓存有效的矩阵转置计划?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

那么明显的方式转置矩阵,是使用:

So the obvious way to transpose a matrix is to use :

  for( int i = 0; i < n; i++ )

    for( int j = 0; j < n; j++ )

      destination[j+i*n] = source[i+j*n];

但我想要的东西,将采取地方和缓存阻塞的优势。我一直在寻找它,并不能找到code,将做到这一点,但我告诉它应该是一个非常简单的修改原来的。有任何想法吗?

but I want something that will take advantage of locality and cache blocking. I was looking it up and can't find code that would do this, but I'm told it should be a very simple modification to the original. Any ideas?

编辑:我有一个2000×2000矩阵,我想知道我怎么可以更改使用两个循环,基本上分割矩阵成块的code表示我个人转,说的2x2块,或40×40块,看哪个块大小是最高效的。

I have a 2000x2000 matrix, and I want to know how can I change the code using two for loops, basically splitting the matrix into blocks that I transpose individually, say 2x2 blocks, or 40x40 blocks, and see which block size is most efficient.

EDIT2:该矩阵存储在列优先顺序,即对于一个矩阵

The matrices are stored in column major order, that is to say for a matrix

a1 a2    
a3 a4

存储为 A1 A2 A3 A4

推荐答案

你可能会想四个回路 - 两个迭代块,然后另外两个执行单块的转置复制。假设为简单起见的块大小分割矩阵的大小,这样的事情我觉得,虽然我要画在信封的背面有些图片是肯定的:

You're probably going to want four loops - two to iterate over the blocks, and then another two to perform the transpose-copy of a single block. Assuming for simplicity a block size that divides the size of the matrix, something like this I think, although I'd want to draw some pictures on the backs of envelopes to be sure:

for (int i = 0; i < n; i += blocksize) {
    for (int j = 0; j < n; j += blocksize) {
        // transpose the block beginning at [i,j]
        for (int k = i; k < i + blocksize; ++k) {
            for (int l = j; l < j + blocksize; ++l) {
                dst[k + l*n] = src[l + k*n];
            }
        }
    }
}

这是重要的进一步了解的是,这事实上是在高速缓存算法忘却这个(见<一href="http://en.wikipedia.org/wiki/Cache-oblivious_algorithm">http://en.wikipedia.org/wiki/Cache-oblivious_algorithm,使用这种确切的问题作为一个例子)。 缓存遗忘的非正式的定义是,你并不需要试验,以便打好/最佳的高速缓存性能调整任何参数(在这种情况下,块大小)。在这种情况下,该溶液是通过递归分割矩阵中一半,并且转置半部成它们在目的地正确的位置进行转置。

An important further insight is that there's actually a cache-oblivious algorithm for this (see http://en.wikipedia.org/wiki/Cache-oblivious_algorithm, which uses this exact problem as an example). The informal definition of "cache-oblivious" is that you don't need to experiment tweaking any parameters (in this case the blocksize) in order to hit good/optimal cache performance. The solution in this case is to transpose by recursively dividing the matrix in half, and transposing the halves into their correct position in the destination.

无论缓存大小居然是,这个递归利用了它的优势。我希望有一些额外的管理开销,你的策略,这是用性能实验,实际上,直接跳到在其中高速缓存确实踢的递归点,并没有进一步去比较。在另一方面,你的表现的实验可能给你你的机器上而不是在客户的计算机问题的解答。

Whatever the cache size actually is, this recursion takes advantage of it. I expect there's a bit of extra management overhead compared with your strategy, which is to use performance experiments to, in effect, jump straight to the point in the recursion at which the cache really kicks in, and go no further. On the other hand, your performance experiments might give you an answer that works on your machine but not on your customers' machines.

这篇关于高速缓存有效的矩阵转置计划?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆