为什么 2048x2048 与 2047x2047 阵列乘法相比,性能会受到巨大影响? [英] Why is there huge performance hit in 2048x2048 versus 2047x2047 array multiplication?

查看:23
本文介绍了为什么 2048x2048 与 2047x2047 阵列乘法相比,性能会受到巨大影响?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在做一些矩阵乘法基准测试,正如前面提到的为什么 MATLAB 的矩阵乘法速度如此之快?

I am making some matrix multiplication benchmarking, as previously mentioned in Why is MATLAB so fast in matrix multiplication?

现在我有另一个问题,当两个 2048x2048 矩阵相乘时,C# 和其他的有很大的不同.当我尝试只乘以 2047x2047 矩阵时,这似乎很正常.也添加了一些其他的进行比较.

Now I've got another issue, when multiplying two 2048x2048 matrices, there is a big difference between C# and others. When I try multiply only 2047x2047 matrices, it seems normal. Added some others for comparsion too.

1024x1024 - 10 秒.

1024x1024 - 10 seconds.

1027x1027 - 10 秒.

1027x1027 - 10 seconds.

2047x2047 - 90 秒.

2047x2047 - 90 seconds.

2048x2048 - 300 秒.

2048x2048 - 300 seconds.

2049x2049 - 91 秒.(更新)

2049x2049 - 91 seconds. (update)

2500x2500 - 166 秒

2500x2500 - 166 seconds

对于 2k x 2k 的情况,这是三分半钟的差异.

That is three and a half minute difference for the 2k by 2k case.

使用二维数组

//Array init like this
int rozmer = 2048;
float[,] matice = new float[rozmer, rozmer];

//Main multiply code
for(int j = 0; j < rozmer; j++)
{
   for (int k = 0; k < rozmer; k++)
   {
     float temp = 0;
     for (int m = 0; m < rozmer; m++)
     {
       temp = temp + matice1[j,m] * matice2[m,k];
     }
     matice3[j, k] = temp;
   }
 }

推荐答案

这可能与 L2 缓存中的冲突有关.

This probably has do with conflicts in your L2 cache.

matice1 上的缓存未命中不是问题,因为它们是按顺序访问的.但是,对于 matice2,如果整个列适合 L2(即当您访问 matice2[0, 0]、matice2[1, 0]、matice2[2, 0] ... 等时,没有任何问题被驱逐)比没有问题使用 matice2 缓存未命中.

Cache misses on matice1 are not the problem because they are accessed sequentially. However for matice2 if a full column fits in L2 (i.e when you access matice2[0, 0], matice2[1, 0], matice2[2, 0] ... etc, nothing gets evicted) than there is no problem with cache misses with matice2 either.

现在更深入地了解缓存的工作原理,如果变量的字节地址是 X,那么它的缓存行将是 (X >> 6) &(L - 1).其中 L 是缓存中的缓存行总数.L 总是 2 的幂.这六个来自于 2^6 == 64 字节是缓存行的标准大小.

Now to go deeper in how caches works, if byte address of your variable is X, than the cache line for it would be (X >> 6) & (L - 1). Where L is total number of cache lines in your cache. L is always power of 2. The six comes from fact that 2^6 == 64 bytes is standard size of cache line.

现在这是什么意思?那么这意味着如果我有地址 X 和地址 Y 并且(X >> 6) - (Y >> 6) 可被 L 整除(即 2 的某个大幂),它们将存储在同一缓存行中.

Now what does this mean? Well it means that if I have address X and address Y and (X >> 6) - (Y >> 6) is divisible by L (i.e. some large power of 2), they will be stored in the same cacheline.

现在回到你的问题,2048 和 2049 有什么区别,

Now to go back to your problem what is the difference between 2048 and 2049,

当 2048 是您的尺寸时:

when 2048 is your size:

如果你取 &matice2[x, k] 和 &matice2[y, k] 的区别 (&matice2[x, k] >> 6) - (&matice2[y,k] >>6) 将被 2048 * 4(浮点大小)整除.所以是 2 的大幂.

if you take &matice2[x, k] and &matice2[y, k] the difference (&matice2[x, k] >> 6) - (&matice2[y,k] >> 6) will be divisible by 2048 * 4 (size of float). So a large power of 2.

因此,根据 L2 的大小,您将有很多缓存行冲突,并且仅使用 L2 的一小部分来存储列,因此您实际上无法在缓存中存储完整列,因此您将表现不佳.

Thus depending on size of your L2 you will have a lot of cache line conflicts, and only utilize small portion of your L2 to store a column, thus you wont actually be able to store full column in your cache, thus you will get bad performance.

当大小为 2049 时,差异为 2049 * 4,它不是 2 的幂,因此您将减少冲突并且您的列将安全地放入您的缓存中.

When size is 2049, then the difference is 2049 * 4 which is not power of 2 thus you will have less conflicts and your column will safely fit into your cache.

现在要测试这个理论,你可以做几件事:

Now to test this theory there are couple things you can do:

像这样 matice2 [razmor, 4096] 分配你的数组 matice2 数组,并使用 razmor = 1024、1025 或任何大小运行,你应该看到与之前相比非常糟糕的性能.这是因为您强行对齐所有列以相互冲突.

Allocate your array matice2 array like this matice2 [razmor, 4096], and run with razmor = 1024, 1025 or any size, and you should see very bad performance compared to what you had before. This is because you forcefully align all columns to conflict with each other.

然后尝试 matice2 [razmor, 4097] 并以任何大小运行它,您应该会看到更好的性能.

Then try matice2 [razmor, 4097] and run it with any size and you should see much better performance.

这篇关于为什么 2048x2048 与 2047x2047 阵列乘法相比,性能会受到巨大影响?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆