为什么L1和L2 Cache浪费空间来保存相同的数据? [英] Why do L1 and L2 Cache waste space saving the same data?

查看:119
本文介绍了为什么L1和L2 Cache浪费空间来保存相同的数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我不知道为什么L1缓存和L2缓存保存相同的数据。



例如,假设我们要第一个访问Memory [x]时间。首先将Memory [x]映射到L2缓存,然后将相同的数据映射到L1缓存,CPU寄存器可以从该缓存中检索数据。



但是我们有重复的数据是否存储在L1和L2缓存中,这是一个问题还是至少浪费了存储空间?

解决方案

您的问题是,为什么CPU会浪费缓存空间,以便在多个级别的缓存中存储相同的数据,因为我想这就是您要问的问题。



并非所有的缓存都像那。外部缓存的 缓存包含策略 可以是包含性的,排他性,或不包含性/非排他性。



NINE是正常情况,不保留任何一种特殊财产,但L2确实倾向于由于您在问题中描述的原因而在L1中具有大多数行的副本。如果L2的关联性不如L1(像在Skylake-client中一样)并且访问模式会在L2中造成很多冲突未命中(不太可能),您可能会得到仅L1中相当数量的数据。也许以其他方式,例如通过硬件预取,或者由于代码获取而从L2逐出数据,因为实际的CPU使用分离的L1i / L1d缓存。






<为了使外部高速缓存有用,您需要某种方式让数据输入它们,以便在较小的L1撤出该行之后的某个时间获得L2命中。通过外部高速缓存获取诸如L1d之类的内部高速缓存可以免费为您提供服务,并具有一些优势。您可以将硬件预取逻辑放在缓存的外部或中间级别,而不必像L1那样具有高性能。 (例如, Intel CPU的大部分预取逻辑都在每个内核的私有L2中,还有L1d中的一些预取逻辑。



另一个主要选择是将外部缓存作为受害者缓存,即行仅当他们从L1撤离时才输入。因此,您可以循环遍历L1 + L2大小的数组,并且可能仍然会获得L2匹配。如果您希望L1相对于L2相对较大,则实现此目的的额外逻辑很有用,因此总大小比单独使用L2大一点。



排他的L2,如果L1d需要从该集合中撤出某些东西,则L1丢失/ L2命中只能在L1d和L2之间交换线路。



实际上有些CPU确实这样做使用L1d以外的L2(例如 AMD K10 / Barcelona 。这两个高速缓存都是私有的每核高速缓存,不是共享的,因此就像您正在谈论的单个核心CPU的简单L1 / L2情况一样。






多核CPU和共享缓存会使情况变得更加复杂!



巴塞罗那的共享L3缓存是也大部分不包括内部缓存,但不严格。大卫·坎特(David Kanter)解释:


首先,它主要是排他性的,但并非完全如此。当一条线从L3高速缓存发送到L1D高速缓存时,如果该高速缓存线是共享的,或者很可能是共享的,则它将保留在L3中–导致重复,这在完全排他的层次结构中永远不会发生。如果所获取的高速缓存行包含代码,或者如果先前已共享数据(跟踪共享历史记录),则很可能被共享。其次,L3的驱逐政策已更改。在K8中,当从内存中引入高速缓存行时,最近使用的伪最少算法将逐出高速缓存中最旧的行。但是,在巴塞罗那的L3中,替换算法已进行了更改,以考虑到共享,并且它更喜欢逐出未共享的行。


AMD的推土机是K10 /巴塞罗那的继任者。 https://www.realworldtech.com/bulldozer/3/ 指出了推土机的共享L3也是受害者缓存,因此主要不包括L2。



但是Bulldozer的L1d是一个小的直写高速缓存,带有一个甚至更小的(4k)写合并缓冲区,因此它主要包含L2。在CPU设计领域,Bulldozer的直写式L1d通常被认为是一个错误,Ryzen回到了正常的32kiB回写式L1d,就像Intel一直以来一直在使用(效果很好)。一对弱整数核心形成一个共享FPU / SIMD单元的集群,并且共享一个很大的L2是包容性的 。 (即可能是标准的NINE)。这个集群的东西是Bulldozer替代SMT /超线程的替代品,AMD也抛弃了Ryzen,转而使用具有广泛乱序内核的普通SMT。



Ryzen也显然,核心集群(CCX)之间有一些专有性,但是我没有研究细节。






一直以来,AMD一直在谈论AMD,因为它们在最近的设计中使用了独占缓存,并且似乎更倾向于牺牲缓存。英特尔没有尝试过太多不同的事情,因为他们使用Nehalem打造了出色的设计,并坚持使用直到Skylake-AVX512。



Intel Nehalem,后来使用了大型共享标签-包含三级缓存。对于专用的每核L1d或L2(NINE)高速缓存中的修改/独占(MESI)的行,L3标签仍指示哪些核心(可能)具有该行的副本,因此从一个核心发出的请求以独占方式访问一条线不必广播到所有核心,而只广播到可能仍对其进行缓存的核心。 (即,它是一个用于一致性流量的探听过滤器,它使CPU甚至可以在不共享内存的情况下将每个芯片的内核最多扩展到数十个内核,而不会互相淹没。)



<即L3标签保存有关(或可能)在L2或L1中某处缓存行的信息,因此它知道在哪里发送无效消息,而不是将消息从每个核心广播到所有其他核心。



借助Skylake-X(Skylake服务器/ SKX / SKL-SP),英特尔放弃了该技术,并做了 L3 NINE,仅比每核L2的总大小大一点 。但是仍然有一个监听过滤器,只是没有数据。我不知道英特尔针对未来(双?)/四核/六核笔记本电脑/台式机芯片(例如Cannonlake / Icelake)打算做什么。足够小,以至于他们的经典环形总线仍然很棒,因此他们可以继续在移动/台式机部件中这样做,而仅在高端/服务器部件中使用网格,就像在Skylake中一样。






Realworldtech论坛对包容性,专有性与非包容性的讨论:



CPU体系结构专家们花时间在该论坛上讨论什么是好的设计。在搜索有关独占缓存的信息时,我发现了此线程,提出了严格包含最后一级缓存的一些缺点。例如它们会强制每个核心的私有L2缓存很小(否则您会浪费太多空间,并且在L3和L2之间进行重复)。



此外,L2缓存会过滤对L3的请求,因此,当其LRU算法需要删除一条线时,最近最少看到的那条线很容易成为永久处于核心L2 / L1状态的线。但是当包含性L3决定删除行时,它也必须从具有该行的所有内部缓存中驱逐它!



David Kanter回复了包含所有外部缓存的优点的有趣列表。我认为他是在使用专有缓存,而不是NINE。例如他关于数据共享更容易的观点仅适用于独占缓存,我认为他的建议是,当多个内核甚至以共享/只读方式需要同一行时,严格独占的缓存层次结构可能会导致驱逐。


I don't know why L1 Cache and L2 Cache save the same data.

For example, let's say we want to access Memory[x] for the first time. Memory[x] is mapped to the L2 Cache first, then the same data piece is mapped to L1 Cache where CPU register can retrieve data from.

But we have duplicated data stored on both L1 and L2 cache, isn't it a problem or at least a waste of storage space?

解决方案

I edited your question to ask about why CPUs waste cache space storing the same data in multiple levels of cache, because I think that's what you're asking.

Not all caches are like that. The Cache Inclusion Policy for an outer cache can be Inclusive, Exclusive, or Not-Inclusive / Not-Exclusive.

NINE is the "normal" case, not maintaining either special property, but L2 does tend to have copies of most lines in L1 for the reason you describe in the question. If L2 is less associative than L1 (like in Skylake-client) and the access pattern creates a lot of conflict misses in L2 (unlikely), you could get a decent amount of data that's only in L1. And maybe in other ways, e.g. via hardware prefetch, or from L2 evictions of data due to code-fetch, because real CPUs use split L1i / L1d caches.


For the outer caches to be useful, you need some way for data to enter them so you can get an L2 hit sometime after the line was evicted from the smaller L1. Having inner caches like L1d fetch through outer caches gives you that for free, and has some advantages. You can put hardware prefetch logic in an outer or middle level of cache, which doesn't have to be as high-performance as L1. (e.g. Intel CPUs have most of their prefetch logic in the private per-core L2, but also some prefetch logic in L1d).

The other main option is for the outer cache to be a victim cache, i.e. lines enter it only when they're evicted from L1. So you can loop over an array of L1 + L2 size and probably still get L2 hits. The extra logic to implement this is useful if you want a relatively large L1 compared to L2, so the total size is more than a little larger than L2 alone.

With an exclusive L2, an L1 miss / L2 hit can just exchange lines between L1d and L2 if L1d needs to evict something from that set.

Some CPUs do in fact use an L2 that's exclusive of L1d (e.g. AMD K10 / Barcelona). Both of those caches are private per-core caches, not shared, so it's like the simple L1 / L2 situation for a single core CPU you're talking about.


Things get more complicated with multi-core CPUs and shared caches!

Barcelona's shared L3 cache is also mostly exclusive of the inner caches, but not strictly. David Kanter explains:

First, it is mostly exclusive, but not entirely so. When a line is sent from the L3 cache to an L1D cache, if the cache line is shared, or is likely to be shared, then it will remain in the L3 – leading to duplication which would never happen in a totally exclusive hierarchy. A fetched cache line is likely to be shared if it contains code, or if the data has been previously shared (sharing history is tracked). Second, the eviction policy for the L3 has been changed. In the K8, when a cache line is brought in from memory, a pseudo-least recently used algorithm would evict the oldest line in the cache. However, in Barcelona’s L3, the replacement algorithm has been changed to also take into account sharing, and it prefers evicting unshared lines.

AMD's successor to K10/Barcelona is Bulldozer. https://www.realworldtech.com/bulldozer/3/ points out that Bulldozer's shared L3 is also victim cache, and thus mostly exclusive of L2. It's probably like Barcelona's L3.

But Bulldozer's L1d is a small write-through cache with an even smaller (4k) write-combining buffer, so it's mostly inclusive of L2. Bulldozer's write-through L1d is generally considered a mistake in the CPU design world, and Ryzen went back to a normal 32kiB write-back L1d like Intel has been using all along (with great results). A pair of weak integer cores form a "cluster" that shares an FPU/SIMD unit, and shares a big L2 that's "mostly inclusive". (i.e. probably a standard NINE). This cluster thing is Bulldozer's alternative to SMT / Hyperthreading, which AMD also ditched for Ryzen in favour of normal SMT with a massively wide out-of-order core.

Ryzen also has some exclusivity between core clusters (CCX), apparently, but I haven't looked into the details.


I've been talking about AMD first because they have used exclusive caches in recent designs, and seem to have a preference for victim caches. Intel hasn't tried as many different things, because they hit on a good design with Nehalem and stuck with it until Skylake-AVX512.

Intel Nehalem and later use a large shared tag-inclusive L3 cache. For lines that are modified / exclusive (MESI) in a private per-core L1d or L2 (NINE) cache, the L3 tags still indicate which cores (might) have a copy of a line, so requests from one core for exclusive access to a line don't have to be broadcast to all cores, only to cores that might still have it cached. (i.e. it's a snoop filter for coherency traffic, which lets CPUs scale up to dozens of cores per chip without flooding each other with requests when they're not even sharing memory.)

i.e. L3 tags hold info about where a line is (or might be) cached in an L2 or L1 somewhere, so it knows where to send invalidation messages instead of broadcasting messages from every core to all other cores.

With Skylake-X (Skylake-server / SKX / SKL-SP), Intel dropped that and made L3 NINE and only a bit bigger than the total per-core L2 size. But there's still a snoop filter, it just doesn't have data. I don't know what Intel's planning to do for future (dual?)/quad/hex-core laptop / desktop chips (e.g. Cannonlake / Icelake). That's small enough that their classic ring bus would still be great, so they could keep doing that in mobile/desktop parts and only use a mesh in high-end / server parts, like they are in Skylake.


Realworldtech forum discussions of inclusive vs. exclusive vs. non-inclusive:

CPU architecture experts spend time discussing what makes for a good design on that forum. While searching for stuff about exclusive caches, I found this thread, where some disadvantages of strictly inclusive last-level caches are presented. e.g. they force private per-core L2 caches to be small (otherwise you waste too much space with duplication between L3 and L2).

Also, L2 caches filter requests to L3, so when its LRU algorithm needs to drop a line, the one it's seen least-recently can easily be one that stays permanently hot in L2 / L1 of a core. But when an inclusive L3 decides to drop a line, it has to evict it from all inner caches that have it, too!

David Kanter replied with an interesting list of advantages for inclusive outer caches. I think he's comparing to exclusive caches, rather than to NINE. e.g. his point about data sharing being easier only applies vs. exclusive caches, where I think he's suggesting that a strictly exclusive cache hierarchy might cause evictions when multiple cores want the same line even in a shared/read-only manner.

这篇关于为什么L1和L2 Cache浪费空间来保存相同的数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆