缓存颠簸,通常有助于理解 [英] Cache thrashing, general help in understanding

查看:329
本文介绍了缓存颠簸,通常有助于理解的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图理解高速缓存的抖动,以下文字正确吗?

I am trying to understand cache thrashing, is the following text correct?

以下面的代码为例。

long max = 1024*1024;
long a(max), b(max), c(max), d(max), e(max); 
for(i = 1; i < max; i++) 
    a(i) = b(i)*c(i) + d(i)*e(i);

ARM Cortex A9具有四种关联方式,每条缓存行为32字节,总缓存为32kb 。总共有1024条缓存行。为了进行上述计算,必须移动一条高速缓存行。当要计算a(i)时,将抛出b(i)。然后,随着循环的迭代,需要b(i),因此另一个向量被置换。在上面的示例中,没有缓存重用。

The ARM Cortex A9 is four way set associative and each cache line is 32 bytes, total cache is 32kb. In total there are 1024 cache lines. In order to carry out the above calculation one cache line must be displaced. When a(i) is to be calculated, b(i) will be thrown out. Then as the loop iterates, b(i) is needed and so another vector is displaced. In the example above, there is no cache reuse.

要解决此问题,可以在向量之间引入填充,以隔开它们的起始地址。理想情况下,每次填充都应至少为完整缓存行的大小。

To solve this problem, you can introduce padding between the vectors in order to space out their beginning address. Ideally, each padding should be at least the size of a full cache line.

上述问题可以这样解决

long a(max), pad1(256), b(max), pad2(256), c(max), pad3(256), d(max), pad4(256), e(max) 

对于多维数组,足以使前导维成为

For multidimensional arrays, it is enough to make the leading dimension an odd number.

如果以上情况为真或我在哪里出错,则可以提供任何帮助。

Any help if the above is true or where I have made an error.

谢谢。

推荐答案

每个向量需要8MB的内存(1024 * 1024 * 8B,假设长时间为8B)。因此,如果这些向量是连续分配的,则a(i),b(i),c(i),d(i)和e(i)将映射到同一缓存集(并非总是映射到同一缓存行) 2种方法)。但是,缓存集中一次只能有两个。因此,当将包含d(i)和e(i)的缓存行放入缓存时,将清除包含b(i)和c(i)的缓存行。

Each vector needs 8MB of memory(1024 * 1024 * 8B, assuming 8B for long). So if these vectors are contiguously allocated, then a(i), b(i), c(i), d(i) and e(i) will map to the same cache set(not same cache line always, as it is 2 way). Nevertheless there can only be two of them at a time in the cache set. So when cache lines containing d(i) and e(i) are brought in cache, cache lines containing b(i) and c(i) will be evicted.

如果您确定这些向量是连续分配的,则可以用一种高速缓存行大小(即32B)填充它们。这样就可以了。因此a(i),b(i),c(i),d(i)和e(i)将位于连续的缓存集上。在访问向量的4个元素之后,将逐出每个缓存行。这是因为每条缓存行包含4个长变量(a(0),a(1),a(2),a(3)将在同一缓存行上,以及a(4),a(5), a(6),a(7))。

If you are sure that these vectors are contiguously allocated, you can just pad them with one cache line size i.e. 32B. That will do the trick. So a(i), b(i), c(i), d(i) and e(i) will be on contiguous cache sets. And after accessing 4 elements of a vector, each cache line will be evicted. This is because each cache line contains 4 long variables(a(0), a(1), a(2), a(3) will on the same cache line, as will be a(4), a(5), a(6), a(7)).

因此,您将向量声明为

long a(max),pad1(32),b(max),pad2(32),c(max),pad3(32),d(max),pad4(32),e(max)

有关讨论,您可以点击此链接

For related discussion, you can follow this link

why-is-one-loop-比两个循环慢得多

这篇关于缓存颠簸,通常有助于理解的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆