缓存关联性如何影响性能 [英] How does cache associativity impact performance

查看:241
本文介绍了缓存关联性如何影响性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在阅读Andrey Akinshin撰写的 Pro .NET Benchmarking,一件事使我感到困惑(p.536)-说明缓存关联性会影响性能。在一个测试中,作者使用了3个正方形数组1023x1023、1024x1024、1025x1025的 ints ,并观察到在1024x1024的情况下访问第一列的速度较慢。



作者解释(背景信息,CPU是具有32KB内存的L1高速缓存的Intel,它是8路关联的):


当N = 1024时,此差异正好是4096个字节;它等于
的临界跨度值。这意味着来自第一个
列的所有元素都与L1的相同八个缓存行匹配。我们实际上并没有从
缓存中获得性能收益,因为我们无法有效地使用
:我们只有512字节(8个缓存行* 64字节缓存
行大小)原始的32 KB。当我们在循环中迭代
的第一列时,相应的元素会从
缓存中弹出。当N = 1023和N = 1025时,
的临界跨度不再存在问题:所有元素都可以保留在缓存中,这使
效率更高。


因此,惩罚似乎是由于主内存无法映射到完整的缓存而以某种方式缩小缓存造成的。



让我感到奇怪的是,在阅读Wiki页面后,我会说性能损失来自解决地址冲突。由于每行都可以映射到同一缓存行,因此冲突是冲突之后的,CPU必须解决这些问题-这需要时间。



因此,我的问题是,这里的性能问题的实质是什么?高速缓存的可访问内存大小较小,或者整个高速缓存可用,但是CPU花更多的时间来解决与映射的冲突。还是有其他原因?

解决方案

缓存是其他两层之间的一层。在您的情况下,在CPU和RAM之间。最好的情况是,CPU很少需要等待从RAM中获取某些内容。在最坏的情况下,CPU通常必须等待。



1024的例子很糟糕。对于从RAM请求的整个列,所有所有字都落入缓存的相同单元中(如果使用2路关联缓存,则为相同的2个单元,等等)。 / p>

同时,CPU不在乎-它向缓存请求内存中的一个单词;高速缓存要么拥有它(快速访问),要么需要进入RAM(慢速访问)来获取它。 RAM并不关心,它会在请求到来时对其进行响应。



返回1024。查看该数组在内存中的布局。该行的单元以RAM的连续字为单位;当一行结束时,下一行开始。稍加思考,您可以看到中的连续单元格的地址相差1024 * N(当N = 4或8(或任何单元格大小)时)。这是2的幂。



现在,让我们看一下缓存的相对琐碎的体系结构。 (这是琐碎的,因为它需要快速且易于实现。)它只需从地址中取出几位就可以在高速缓存的内存中形成地址。



由于2的幂,这些位将始终相同-因此,将访问相同的插槽。 (我遗漏了一些细节,例如现在需要很多位,因此需要缓存的大小,2路等)。



缓存非常有用当位于其上方的进程(CPU)多次获取一个项目(单词)之前,该项目被需要空间的其他其他项目从缓存中撞出。



注意:这是在谈论CPU-> RAM缓存,而不是磁盘控制器缓存,数据库缓存,网站页面缓存等。他们使用更复杂的算法(通常是散列),而不是从地址中挑出几位。



返回您的问题...


因此,惩罚似乎是由于主内存无法映射到完整的缓存而以某种方式缩小了缓存。


该引用存在概念上的问题。




  • 主内存未映射为缓存;看到虚拟地址还是真实地址。

  • 当缓存中没有所需的单词时会受到惩罚。

  • 缩小缓存-缓存是固定大小,取决于所涉及的硬件。



定义:在这种情况下,字是一个连续的字节串从RAM。它总是(?)2的幂,位于实数地址空间中。缓存的字取决于CPU的使用时间,缓存的级别等。今天可能可以找到4、8、16字节的字。同样,2的幂和倍数定位是简单的优化。



返回您的1K * 1K数组,例如4字节数字。总共加或减4MB(用于1023、1025)。如果您有8MB的缓存,则最终将加载整个阵列,并且由于位于缓存中,因此对阵列的进一步操作将更快。但是,如果您拥有1MB的缓存,则某些阵列将进入缓存,然后被反复淘汰。如果没有缓存,它可能不会更好。


I am reading "Pro .NET Benchmarking" by Andrey Akinshin and one thing puzzles me (p.536) -- explanation how cache associativity impacts performance. In a test author used 3 square arrays 1023x1023, 1024x1024, 1025x1025 of ints and observed that accessing first column was slower for 1024x1024 case.

Author explained (background info, CPU is Intel with L1 cache with 32KB memory, it is 8-way associative):

When N=1024, this difference is exactly 4096 bytes; it equals the critical stride value. This means that all elements from the first column match the same eight cache lines of L1. We don’t really have performance benefits from the cache because we can’t use it efficiently: we have only 512 bytes (8 cache lines * 64-byte cache line size) instead of the original 32 kilobytes. When we iterate the first column in a loop, the corresponding elements pop each other from the cache. When N=1023 and N=1025, we don’t have problems with the critical stride anymore: all elements can be kept in the cache, which is much more efficient.

So it looks like the penalty comes from somehow shrinking the cache just because the main memory cannot be mapped to full cache.

It strikes me as odd, after reading wiki page I would say the performance penalty comes from resolving address conflicts. Since each row can be potentially mapped into the same cache line, it is conflict after conflict, and CPU has to resolve those -- it takes time.

Thus my question, what is the real nature of performance problem here. Accessible memory size of cache is lower, or entire cache is available but CPU spends more time in resolving conflicts with mapping. Or there is some other reason?

解决方案

Caching is a layer between two other layers. In your case, between the CPU and RAM. At its best, the CPU rarely has to wait for something to be fetched from RAM. At its worst, the CPU usually has to wait.

The 1024 example hits a bad case. For that entire column all words requested from RAM land in the same cell in cache (or the same 2 cells, if using a 2-way associative cache, etc).

Meanwhile, the CPU does not care -- it asks the cache for a word from memory; the cache either has it (fast access) or needs to reach into RAM (slow access) to get it. And RAM does not care -- it responds to requests, whenever they come.

Back to 1024. Look at the layout of that array in memory. The cells of the row are in consecutive words of RAM; when one row is finished, the next row starts. With a little bit of thought, you can see that consecutive cells in a column have addresses differing by 1024*N, when N=4 or 8 (or whatever the size of a cell). That is a power of 2.

Now let's look at the relatively trivial architecture of a cache. (It is 'trivial' because it needs to be fast and easy to implement.) It simply takes several bits out of the address to form the address in the cache's "memory".

Because of the power of 2, those bits will always be the same -- hence the same slot is accessed. (I left out a few details, like now many bits are needed, hence the size of the cache, 2-way, etc, etc.)

A cache is useful when the process above it (CPU) fetches an item (word) more than once before that item gets bumped out of cache by some other item needing the space.

Note: This is talking about the CPU->RAM cache, not disk controller caching, database caches, web site page caches, etc, etc; they use more sophisticated algorithms (often hashing) instead of "picking a few bits out of an address".

Back to your Question...

So it looks like the penalty comes from somehow shrinking the cache just because the main memory cannot be mapped to full cache.

There are conceptual problems with that quote.

  • Main memory is not "mapped to a cache"; see virtual versus real addresses.
  • The penalty comes when the cache does not have the desired word.
  • "shrinking the cache" -- The cache is a fixed size, based on the hardware involved.

Definition: In this context, a "word" is a consecutive string of bytes from RAM. It is always(?) a power-of-2 bytes and positioned at some multiple of that in the reall address space. A "word" for caching depends on vintage of the CPU, which level of cache, etc. 4-, 8-, 16-byte words probably can be found today. Again, the power-of-2 and positioned-at-multiple... are simple optimizations.

Back to your 1K*1K array of, say, 4-byte numbers. That adds up to 4MB, plus or minus (for 1023, 1025). If you have 8MB of cache, the entire array will eventually get loaded, and further actions on the array will be faster due to being in the cache. But if you have, say, 1MB of cache, some of the array will get in the cache, then be bumped out -- repeatedly. It might not be much better than if you had no cache.

这篇关于缓存关联性如何影响性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆