为什么转置512x512的矩阵比转置513x513的矩阵慢得多? [英] Why is transposing a matrix of 512x512 much slower than transposing a matrix of 513x513?

查看:176
本文介绍了为什么转置512x512的矩阵比转置513x513的矩阵慢得多?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在对不同大小的方阵进行一些实验后,出现了一个模式。不变地,移调大小 2 ^ n 的矩阵比转移大小 2 ^ n + 1 。对于 n 的小值,差异不是主要的。



然而,超过512的值会出现大的差异。 (至少对我来说)



免责声明:我知道函数不会实际转置矩阵,因为元素的双重交换,但它没有差别



遵循以下代码:

  #define示例1000 
#define MATSIZE 512

#include< time.h>
#include< iostream>
int mat [MATSIZE] [MATSIZE];

void transpose()
{
for(int i = 0; i for(int j = 0; j {
int aux = mat [i] [j];
mat [i] [j] = mat [j] [i];
mat [j] [i] = aux;
}
}

int main()
{
//初始化矩阵
for(int i = 0; i for(int j = 0; j mat [i] [j] = i +

int t = clock();
for(int i = 0; i transpose();
int elapsed = clock() - t;

std :: cout<< 对于<< MATSIZE<< :<< elapsed / SAMPLES;
}

更改 MATSIZE 我们改变大小(duh!)。我在ideone上发布了两个版本:





在我的环境(MSVS 2010,完全优化)中,区别是类似的:




  • size 512 - 平均 2.19 ms

  • size 512 h2_lin>解决方案

解释来自Agner Fog在 优化C ++中的软件 ,它减少了如何访问和存储缓存中的数据。



有关条款和详细信息, href =http://en.wikipedia.org/wiki/L2_cache =nofollow>缓存上的wiki条目,我会在这里缩小它。



缓存以组织。一次只使用一个集合,其中可以使用它包含的任何行。内存一行可以镜像次数的行数给我们的缓存大小。



对于特定的内存地址,我们可以计算应该使用以下公式镜像哪个集合:

  set =(address / lineSize)%numberOfsets 

这种公式给出了在集合之间的理想的均匀分布,因为每个存储器地址很可能被读取(我理想地 )。

可能发生重叠。在高速缓存未命中的情况下,在高速缓存中读取存储器并且替换旧值。记住每个集合都有一些行,其中最近最少使用的一行被新读取的内存覆盖。



我将尝试稍微遵循Agner的示例:



假设每个集合有4行,保持64字节。我们首先尝试读取 0x2710 的地址,其中 28 。然后我们还尝试读取地址 0x2F00 0x3700 0x3F00 0x4700 。所有这些属于同一集合。在读取 0x4700 之前,集合中的所有行都将被占用。读取该内存会移除集合中的现有行,该行最初持有 0x2710 。问题在于我们读取(<示例) 0x800 的地址。这是关键步长(在此示例中也是如此)。



关键步长也可以计算:

  criticaStride = numberOfSets * lineSize 

变量间隔 criticalStride 或者是相同缓存行的多个竞争。



这是理论部分。接下来,解释(也是Agner,我密切关注以避免犯错误):



假设一个64x64的矩阵(记住,效果根据缓存)具有8kb缓存,每行4行* 64字节的行大小。每行可以容纳矩阵中的8个元素(64位 int )。



临界步长为2048字节,对应于矩阵的4行(在存储器中是连续的)。



假设我们正在处理第28行。我们试图获取这一行的元素,并用第28列的元素进行交换。 row构成了一个缓存行,但是他们将进入第28列中的8个不同的缓存行。记住,critical stride是4行(列中有4个连续的元素)。



当列到达元素16(每组4个高速缓存行&4行间隔=故障)时,ex-0元素将从高速缓存中逐出。当我们到达列的末尾时,所有以前的缓存行都将丢失,并且在访问下一个元素时需要重新加载(整个行被覆盖)。



有一个不是关键步幅的倍数的尺寸会使这种完美情景发生灾难,因为我们不再处理在垂直方向上关键跨越的元素,因此缓存重载的数量被严重减少。



另一个免责声明 - 我刚刚得到了我的解释,希望我钉住它,但我可能会错。反正,我在等待神秘的回复(或确认)。 :)


After conducting some experiments on square matrices of different sizes, a pattern came up. Invariably, transposing a matrix of size 2^n is slower than transposing one of size 2^n+1. For small values of n, the difference is not major.

Big differences occur however over a value of 512. (at least for me)

Disclaimer: I know the function doesn't actually transpose the matrix because of the double swap of elements, but it makes no difference.

Follows the code:

#define SAMPLES 1000
#define MATSIZE 512

#include <time.h>
#include <iostream>
int mat[MATSIZE][MATSIZE];

void transpose()
{
   for ( int i = 0 ; i < MATSIZE ; i++ )
   for ( int j = 0 ; j < MATSIZE ; j++ )
   {
       int aux = mat[i][j];
       mat[i][j] = mat[j][i];
       mat[j][i] = aux;
   }
}

int main()
{
   //initialize matrix
   for ( int i = 0 ; i < MATSIZE ; i++ )
   for ( int j = 0 ; j < MATSIZE ; j++ )
       mat[i][j] = i+j;

   int t = clock();
   for ( int i = 0 ; i < SAMPLES ; i++ )
       transpose();
   int elapsed = clock() - t;

   std::cout << "Average for a matrix of " << MATSIZE << ": " << elapsed / SAMPLES;
}

Changing MATSIZE lets us alter the size (duh!). I posted two versions on ideone:

In my environment (MSVS 2010, full optimizations), the difference is similar :

  • size 512 - average 2.19 ms
  • size 513 - average 0.57 ms

Why is this happening?

解决方案

The explanation comes from Agner Fog in Optimizing software in C++ and it reduces to how data is accessed and stored in the cache.

For terms and detailed info, see the wiki entry on caching, I'm gonna narrow it down here.

A cache is organized in sets and lines. At a time, only one set is used, out of which any of the line it contains can be used. The memory a line can mirror times the number of lines gives us the cache size.

For a particular memory address, we can calculate which set it should be mirrored in with the formula:

set = ( address / lineSize ) % numberOfsets

This sort of formula is gives ideally uniform distribution across the sets, because each memory address is as likely to be read (I said ideally).

It's clear that overlaps can occur. In case of a cache miss, the memory is read in the cache and the old value is replaced. Remember each set has a number of lines, out of which the least recently used one is overwritten with the newly read memory.

I'll try to somewhat follow the example from Agner:

Assume each set has 4 lines, each holding 64 bytes. We first attempt to read the address 0x2710, which goes in set 28. And then we also attempt to read addresses 0x2F00, 0x3700, 0x3F00 and 0x4700. All of these belong to the same set. Before reading 0x4700, all lines in the set would have been occupied. Reading that memory evicts an existing line in the set, the line that initially was holding 0x2710. The problem lies in the fact that we read addresses that are (for this example) 0x800 apart. This is the critical stride (again, for this example).

The critical stride can also be calculated:

criticaStride = numberOfSets * lineSize

Variables spaced criticalStride or a multiple apart contend for the same cache lines.

This is the theory part. Next, the explanation (also Agner, I'm following it closely to avoid making mistakes):

Assume a matrix of 64x64 (remember, the effects vary according to the cache) with an 8kb cache, 4 lines per set * line size of 64 bytes. Each line can hold 8 of the elements in the matrix (64-bit int).

The critical stride would be 2048 bytes, which correspond to 4 rows of the matrix (which is continuous in memory).

Assume we're processing row 28. We're attempting to take the elements of this row and swap them with the elements from column 28. The first 8 elements of the row make up a cache line, but they'll go into 8 different cache lines in column 28. Remember, critical stride is 4 rows apart (4 consecutive elements in a column).

When element 16 is reached in the column (4 cache lines per set & 4 rows apart = trouble) the ex-0 element will be evicted from the cache. When we reach the end of the column, all previous cache lines would have been lost and needed reloading on access to the next element (the whole line is overwritten).

Having a size that is not a multiple of the critical stride messes up this perfect scenario for disaster, as we're no longer dealing with elements that are critical stride apart on the vertical, so the number of cache reloads is severely reduced.

Another disclaimer - I just got my head around the explanation and hope I nailed it, but I might be mistaken. Anyway, I'm waiting for a response (or confirmation) from Mysticial. :)

这篇关于为什么转置512x512的矩阵比转置513x513的矩阵慢得多?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆