为什么这些矩阵乘法的性能如此不同? [英] Why is the performance of these matrix multiplications so different?

查看:225
本文介绍了为什么这些矩阵乘法的性能如此不同?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我用Java编写了两个矩阵类,只是为了比较它们的矩阵乘法的性能.一类(Mat1)存储一个double[][] A成员,其中矩阵的行iA[i].另一个类(Mat2)存储AT,其中TA的转置.

I wrote two matrix classes in Java just to compare the performance of their matrix multiplications. One class (Mat1) stores a double[][] A member where row i of the matrix is A[i]. The other class (Mat2) stores A and T where T is the transpose of A.

假设我们有一个方矩阵M,我们想要M.mult(M)的乘积.将产品称为P.

Let's say we have a square matrix M and we want the product of M.mult(M). Call the product P.

当M是Mat1实例时,使用的算法很简单:

When M is a Mat1 instance the algorithm used was the straightforward one:

P[i][j] += M.A[i][k] * M.A[k][j]
    for k in range(0, M.A.length)

在我使用的M是Mat2的情况下:

In the case where M is a Mat2 I used:

P[i][j] += M.A[i][k] * M.T[j][k]

,因为T[j][k]==A[k][j],所以算法相同.在1000x1000矩阵上,第二种算法在我的计算机上大约需要1.2秒,而第一种算法至少需要25秒.我期望第二个更快,但不会那么快.问题是,为什么它要快得多?

which is the same algorithm because T[j][k]==A[k][j]. On 1000x1000 matrices the second algorithm takes about 1.2 seconds on my machine, while the first one takes at least 25 seconds. I was expecting the second one to be faster, but not by this much. The question is, why is it this much faster?

我唯一的猜测是,第二种算法可以更好地利用CPU缓存,因为数据以大于1个字的块的形式被拉入缓存,而第二种算法则通过仅遍历行而从中受益,而第一种算法则忽略了通过立即转到下面的行(在内存中大约有1000个字,因为数组以行主顺序存储)将数据拉入缓存,没有数据被缓存.

My only guess is that the second one makes better use of the CPU caches, since data is pulled into the caches in chunks larger than 1 word, and the second algorithm benefits from this by traversing only rows, while the first ignores the data pulled into the caches by going immediately to the row below (which is ~1000 words in memory, because arrays are stored in row major order), none of the data for which is cached.

我问一个人,他认为这是因为内存访问模式更友好(即第二个版本将导致更少的TLB软错误).我一点都没有想到,但是我可以看到它如何导致更少的TLB故障.

I asked someone and he thought it was because of friendlier memory access patterns (i.e. that the second version would result in fewer TLB soft faults). I didn't think of this at all but I can sort of see how it results in fewer TLB faults.

那是什么?还是性能差异还有其他原因吗?

So, which is it? Or is there some other reason for the performance difference?

推荐答案

这是由于您的数据的本地性.

This because of locality of your data.

在RAM中,矩阵虽然是二维的,但它当然存储为连续的字节数组.与一维数组的唯一区别是,偏移量是通过对您使用的两个索引进行插值计算得出的.

In RAM a matrix, although bidimensional from your point of view, it's of course stored as a contiguous array of bytes. The only difference from a 1D array is that the offset is calculated by interpolating both indices that you use.

这意味着,如果访问位置x,y处的元素,它将计算x*row_length + y,这将是用于引用指定位置处的元素的偏移量.

This means that if you access element at position x,y it will calculate x*row_length + y and this will be the offset used to reference to the element at position specified.

发生的事情是,大矩阵不仅仅存储在一页内存中(这是操作系统通过将内存分成几块来管理RAM的方式),因此如果尝试,它必须在CPU高速缓存中加载正确的页面访问尚不存在的元素.

What happens is that a big matrix isn't stored in just a page of memory (this is how you OS manages the RAM, by splitting it into chunks) so it has to load inside CPU cache the correct page if you try to access an element that is not already present.

只要您连续进行乘法操作,就不会产生任何问题,因为您主要使用页面的所有系数,然后切换到下一个系数,但是如果您对索引进行反转,则可能是每个元素都可能是包含在不同的内存页面中,因此每次需要向RAM请求不同的页面时,几乎每次执行一次乘法时,这就是为什么差异如此整齐的原因.

As long as you go contiguously doing your multiplication you don't create any problems, since you mainly use all coefficients of a page and then switch to the next one but if you invert indices what happens is that every single element may be contained in a different memory page so everytime it needs to ask to RAM a different page, this almost for every single multiplication you do, this is why the difference is so neat.

(我只是简化了整个解释,仅是为您提供有关此问题的基本思想)

(I rather simplified the whole explaination, it's just to give you the basic idea around this problem)

无论如何我都不认为这是由JVM本身引起的.可能与您的操作系统如何管理Java进程的内存有关.

In any case I don't think this is caused by JVM by itself. It maybe related in how your OS manages the memory of the Java process..

这篇关于为什么这些矩阵乘法的性能如此不同?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆