缓存大小和数组大小如何影响数组上数学运算的性能? [英] how the cache size and array size affect the performance of mathematical operations on an array?

查看:69
本文介绍了缓存大小和数组大小如何影响数组上数学运算的性能?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试学习缓存的用法.从我做的一些示例实验程序来看,如果我将数组的大小增加到超过特定值,则执行遍历数组并在元素上进行某些操作的时间会突然增加很多.任何人都可以简单地解释一下术语说明缓存大小和数组大小如何影响数组上数学运算的性能?

I am trying to learn the usage of cache. From what I see by doing some sample experiment program, the time taken for execution of a program iterating through an array and doing some operations on the elements suddenly increases very much if I increase the array size beyond a particular value.Can anyone explain in simple terms how the cache size and array size affect the performance of mathematical operations on an array?

推荐答案

如果缓存无法累积数组,则对那些未累积的元素的任何引用都将导致缓存未命中.访问数组元素的方式也有所不同,因为在每次未命中时,处理器都会将数据块带入高速缓存,并认为可能很快需要此数据,从而为避免将来的高速缓存未命中做好准备.

If the cache is not able to accumulate the array, any reference to those non accumulated elements will result into the cache miss. The way you access the array elements is also makes the difference, because on every miss, processor brings block of data into the cache, thinking that this data might be needed soon, preparing itself to avoid future cache misses.

示例:

如果要对连续位置的元素进行操作,性能将会得到改善.因为取决于高速缓存行的大小,处理器将在第一次高速缓存未命中时获取一个内存块.

If you are operating on the elements from consecutive locations, performance will be improved. Because depending upon the size of cache line, processor will fetch a block of memory on first cache miss.

例如,以矩阵乘法为例,我们通过以下方式进行操作.

For example, take an instance of Matrix Multiplication, we do it in following way.

假设:矩阵太大,无法在缓存中累积.

Assume : Matrices are too large to accumulate in cache.

 for (i = 0; i < N; i = i + 1)
      for (j = 0; j < N; j = j + 1)
          A[i*N + j] = (double) random() / SOME_NUMBER;     

 for (i = 0; i < N; i = i + 1)
   for (j = 0; j < N; j = j + 1)
       B[i*N + j] = (double) random() / SOME_NUMBER;


 for (i = 0; i < N; i = i + 1)
    for (j = 0; j < N; j = j + 1)
       for (k = 0; k < N; k = k + 1)
           C[i*N + j] = C[i*N + j] + A[i*N + k]*B[k*N + j];

在这里,每行相乘时,我们继续按列访问第二个矩阵.在 B 中,第一列存储在 B [0],[N-1],B [2N-1] ....... 中,依此类推,不是连续的内存位置.因此,将会有很多缓存未命中.因此,如果我们能够对解决方案进行改进,以便我们能够处理连续的内存位置,并且可以获得一定的性能提升.除了将第二个矩阵存储在行主要形式"中,我们还可以将其存储在列主要形式"中,即以转置形式存储.因此,现在所有列元素都位于连续的位置.

Here while multiplying for every row, we go on accessing the second matrix column wise. In B first column is store in B[0],[N-1],B[2N-1]....... and so on, which are not consecutive memory locations. Hence there will be lot of cache misses. So if we could mold the solution so that we will deal with consecutive memory locations and can have some performance gain. Instead of storing the second matrix in 'Row Major Form', we can store it in 'Column Major Form' i.e. in transposed form. So now all the column elements are in consecutive locations.

A 将按行访问,而 B 将按列访问,因此我们在连续的内存位置中拥有所有它们的各自的东西".

A will be accesses row wise and B will be accessed column wise, so we have all their 'respective thing' in consecutive memory locations.

因此,当处理器遇到第一个未命中时,它将获取一块内存,因此可以避免下一个未命中.在这里,我们利用了"空间局部性",因此也利用了"缓存阻止".

So when processor experiences the first miss, it will fetch a block of memory and hence next miss(es) will be avoided. Here we have exploited the 'Spatial Locality' and hence 'Cache Blocking'.

现在,如果我们按以下方式更改代码,则性能将得到显着改善

Now if we change the code as follows there will be significant improvement in performance

以转置形式存储B:

   B[j*N + i] = random() / SOME_NUMBER;

您还必须按该顺序访问转置数组:

You'll also have to access the transposed array in that order:

  C[i*N + j] = C[i*N + j] + A[i*N + k]*B[j*N + k];

这篇关于缓存大小和数组大小如何影响数组上数学运算的性能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆