英特尔酷睿 i7 处理器使用了哪种缓存映射技术? [英] Which cache mapping technique is used in intel core i7 processor?

查看:19
本文介绍了英特尔酷睿 i7 处理器使用了哪种缓存映射技术?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我了解了不同的缓存映射技术,例如直接映射和完全关联或集合关联映射,以及它们之间的权衡.(维基百科)

I have learned about different cache mapping techniques like direct mapping and fully associative or set associative mapping, and the trade-offs between those. (Wikipedia)

但我很好奇现在英特尔酷睿 i7 或 AMD 处理器中使用的是哪一个?

But I am curious which one is used in Intel core i7 or AMD processors nowadays?

这些技术是如何发展的?还有哪些需要改进的地方?

How have the techniques evolved? And what are things that need to be improved?

推荐答案

直接映射缓存基本上从未用于现代高性能 CPU.对于相同大小的组关联高速缓存,其命中率的巨大优势超过了节能,而控制逻辑只是稍微复杂了一点.现在晶体管的预算非常大.

Direct-mapped caches are basically never used in modern high-performance CPUs. The power savings are outweighed by the large advantage in hit rate for a set-associative cache of the same size, with only a bit more complexity in the control logic. Transistor budgets are very large these days.

软件至少有几个彼此相距 4k 倍的数组是很常见的,这会在直接映射缓存中产生冲突未命中.(如果一个循环需要一次遍历所有数组,则调整具有多个数组的代码可能涉及倾斜它们以减少冲突未命中)

It's very common for software to have at least a couple arrays that are a multiple of 4k apart from each other, which would create conflict misses in a direct-mapped cache. (Tuning code with more than a couple arrays can involve skewing them to reduce conflict misses, if a loop needs to iterate through all of them at once)

现代 CPU 速度如此之快,以至于 DRAM 延迟超过 200 个核心时钟周期,这对于强大的乱序执行 CPU 来说也太大了,无法在缓存未命中时很好地隐藏.

Modern CPUs are so fast that DRAM latency is over 200 core clock cycles, which is too big even for powerful out-of-order execution CPUs to hide very well on a cache miss.

多级缓存是必不可少的(并且使用的是所有高性能 CPU)来为最热数据(例如 每个时钟最多 2 个加载和 1 个存储,分别为 128、256甚至 L1D 缓存和向量加载/存储执行单元之间的 512 位路径),同时仍然足够大以缓存合理大小的工作集.构建一个非常大/非常快/高度关联的缓存在物理上是不可能的,它的性能与当前典型工作负载的多级缓存一样;当数据必须物理传输很远时,光速延迟是一个问题.电力成本也会令人望而却步.(实际上,功率/功率密度是现代 CPU 的主要限制因素,请参阅 现代微处理器:A 90-分钟指南!.)

Multi-level caches are essential (and used is all high-performance CPUs) to give the low latency (~4 cycles) / high throughput for the hottest data (e.g. up to 2 loads and 1 store per clock, with a 128, 256 or even 512-bit path between L1D cache and vector load/store execution units), while still being large enough to cache a reasonable sized working set. It's physically impossible to build one very large / very fast / highly-associative cache that performs as well as current multi-level caches for typical workloads; speed-of-light delays when data has to physically travel far are a problem. The power cost would be prohibitive as well. (In fact, power / power density is a major limiting factor for modern CPUs, see Modern Microprocessors: A 90-Minute Guide!.)

所有级别的缓存(除了 uop 缓存)在我知道的所有 x86 CPU 中都有物理索引/物理标记.大多数设计中的 L1D 缓存从页面偏移下方获取它们的索引位,因此也是 VIPT,允许 TLB 查找与标签提取并行发生,但没有任何别名问题.因此,不需要在上下文切换或任何事情上刷新缓存.(见 this answer有关多级缓存的更多信息一般和VIPT速度技巧,以及一些实际x86 CPU的一些缓存参数.)

All levels of cache (except the uop cache) are physically indexed / physically tagged in all the x86 CPUs I'm aware of. L1D caches in most designs take their index bits from below the page offset, and thus are also VIPT allowing TLB lookup to happen in parallel with tag fetch, but without any aliasing problems. Thus, caches don't need to be flushed on context switches or anything. (See this answer for more about multi-level caches in general and the VIPT speed trick, and some cache parameters of some actual x86 CPUs.)

私有(每核)L1D/L1I 和 L2 缓存是传统的组关联缓存,对于小型/快速缓存通常为 8 路或 4 路.所有现代 x86 CPU 上的缓存线大小为 64 字节.数据缓存是回写的.(AMD 推土机系列除外,其中 L1D 是通过小型 4kiB 写入组合缓冲区进行直写.)

The private (per-core) L1D / L1I and L2 caches are traditional set-associative caches, often 8-way or 4-way for the small/fast caches. Cache line size is 64 bytes on all modern x86 CPUs. The data caches are write-back. (Except on AMD Bulldozer-family, where L1D is write-through with a small 4kiB write-combining buffer.)

http://www.7-cpu.com/ 具有良好的缓存组织/延迟数字、带宽和 TLB 组织/性能数字,用于各种微架构,包括许多 x86,像 Haswell.

http://www.7-cpu.com/ has good cache organization / latency numbers, and bandwidth, and TLB organization / performance numbers, for various microarchitectures, including many x86, like Haswell.

L0"英特尔 Sandybridge 系列中的解码 uop 缓存是集合关联的,并且是虚拟寻址的.多达 6 个 uop 的多达 3 个块可以缓存来自 32 字节机器代码块中的指令的解码结果.相关:涉及微循环的分支对齐- Intel SnB 系列 CPU 上的编码指令.(uop 缓存是 x86 的一大进步:x86 指令长度可变,难以快速/并行解码,因此缓存内部解码结果以及机器码(L1I$)具有显着的功率和吞吐量优势.强大的仍然需要解码器,因为 uop 缓存并不大;它在循环(包括中到大循环)中最有效.这避免了 Pentium4 错误(或当时基于传输器大小的限制)具有弱解码器和依赖跟踪缓存.)

The "L0" decoded-uop cache in Intel Sandybridge-family is set-associative and virtually addressed. Up to 3 blocks of up to 6 uops can cache decode results from instructions in a 32-byte block of machine code. Related: Branch alignment for loops involving micro-coded instructions on Intel SnB-family CPUs. (A uop cache is a big advance for x86: x86 instructions are variable-length and hard to decode fast / in parallel, so caching the internal decode results as well as the machine code (L1I$) has significant power and throughput advantages. Powerful decoders are still needed, because the uop cache isn't large; it's most effective in loops (including medium to large loops). This avoids the Pentium4 mistake (or limitation based on transitor size at the time) of having weak decoders and relying on the trace cache.)

现代英特尔(和 AMD,我假设)L3 aka LLC aka 最后一级缓存使用索引功能,而不仅仅是地址位范围.这是一个散列函数,可以更好地分配事物以减少固定步幅的冲突.根据英特尔的说法,我的缓存虽然是 12 路,但应该是 24 路关联的,这是怎么回事?.

Modern Intel (and AMD, I assume) L3 aka LLC aka last-level caches use an indexing function that isn't just a range of address bits. It's a hash function that better distributes things to reduce collisions from fixed strides. According to Intel my cache should be 24-way associative though its 12-way, how is that?.

从 Nehalem 开始,英特尔使用了大型包容性共享 L3 缓存,用于过滤内核之间的一致性流量.即,当一个核心读取另一个核心的 L1d 中处于修改状态的数据时,L3 标签会说明哪个核心,因此 RFO(读取所有权)只能发送到该核心,而不是广播.现代英特尔 CPU L3 缓存是如何组织的?.包容性属性很重要,因为这意味着没有私有 L2 或 L1 缓存可以在 L3 不知道的情况下拥有缓存行的副本.如果它在私有缓存中处于 Exclusive 或 Modified 状态,则 L3 将具有该行的无效数据,但标签仍会说明哪个核心可能有副本.绝对没有副本的内核不需要发送有关它的消息,从而通过内核和 L3 之间的内部链接节省功率和带宽.请参阅为什么片上缓存一致性存在以了解更多信息有关英特尔i7"中片上缓存一致性的详细信息(即 Nehalem 和 Sandybridge 系列,它们是不同的架构,但使用相同的缓存层次结构).

From Nehalem onwards, Intel has used a large inclusive shared L3 cache, which filters coherency traffic between cores. i.e. when one core reads data which is in Modified state in L1d of another core, L3 tags say which core, so an RFO (Read For Ownership) can be sent only to that core, instead of broadcast. How are the modern Intel CPU L3 caches organized?. The inclusivity property is important, because it means no private L2 or L1 cache can have a copy of a cache line without L3 knowing about it. If it's in Exclusive or Modified state in a private cache, L3 will have Invalid data for that line, but the tags will still say which core might have a copy. Cores that definitely don't have a copy don't need to be sent a message about it, saving power and bandwidth over the internal links between cores and L3. See Why On-Chip Cache Coherence Is Here to Stay for more details about on-chip cache coherency in Intel "i7" (i.e. Nehalem and Sandybridge-family, which are different architectures but do use the same cache hierarchy).

Core2Duo 有一个共享的最后一级缓存 (L2),但在 L2 未命中时生成 RFO(Read-For-Ownership)请求的速度很慢.因此,具有适合 L1d 的小缓冲区的内核之间的带宽与不适合 L2 的大缓冲区(即 DRAM 速度)一样慢.当缓冲区适合 L2 而不是 L1d 时,大小范围很广,因为写入核心将自己的数据驱逐到 L2,其他核心的负载可以在此处命中而不生成 RFO 请求.(参见 Ulrich Drepper 的每个程序员应该做什么"中的图 3.27:2 个线程的 Core 2 带宽了解内存".(此处为完整版).

Core2Duo had a shared last-level cache (L2), but was slow at generating RFO (Read-For-Ownership) requests on L2 misses. So bandwidth between cores with a small buffer that fits in L1d is as slow as with a large buffer that doesn't fit in L2 (i.e. DRAM speed). There's a fast range of sizes when the buffer fits in L2 but not L1d, because the writing core evicts its own data to L2 where the other core's loads can hit without generating an RFO request. (See Figure 3.27: Core 2 Bandwidth with 2 Threads in Ulrich Drepper's "What Every Programmer Should Know about Memory". (Full version here).

Skylake-AVX512 的每核 L2 更大(1MiB 而不是 256k),每核具有更小的 L3 (LLC) 切片.它不再具有包容性.它使用网状网络而不是环形总线将内核相互连接.参见 这篇 AnandTech 文章(但它在其他页面的微架构细节中存在一些不准确之处,查看我留下的评论).

Skylake-AVX512 has larger per-core L2 (1MiB instead of 256k), and smaller L3 (LLC) slices per core. It's no longer inclusive. It uses a mesh network instead of a ring bus to connect cores to each other. See this AnandTech article (but it has some inaccuracies in the microarchitectural details on other pages, see the comment I left).

来自 英特尔®至强® 处理器可扩展系列技术概述

由于 LLC 的非包容性,LLC 中没有缓存线并不表示该线不存在于任何内核的私有缓存中.因此,当缓存线未在 LLC 中分配时,监听过滤器用于跟踪 L1 或 MLC 内核中缓存线的位置.在上一代 CPU 上,共享 LLC 自己负责这项任务.

Due to the non-inclusive nature of LLC, the absence of a cache line in LLC does not indicate that the line is not present in private caches of any of the cores. Therefore, a snoop filter is used to keep track of the location of cache lines in the L1 or MLC of cores when it is not allocated in the LLC. On the previous-generation CPUs, the shared LLC itself took care of this task.

这个监听过滤器"只有当它不能有假阴性时才有用.可以向没有副本的核心发送无效或 RFO(MESI)的一条线.当另一个核心请求独占访问它时,让一个核心保留一个行的副本是不行的.所以它可能是一个包含标签的跟踪器,它知道哪些内核可能有哪一行的副本,但不缓存任何数据.

This "snoop-filter" is only useful if it can't have false negatives. It's ok to send an invalidate or RFO (MESI) to a core that doesn't have a copy of a line. It's not ok to let a core keep a copy of a line when another core is requesting exclusive access to it. So it may be a tag-inclusive tracker that knows which cores might have copies of which line, but which doesn't cache any data.

或者,如果不严格包含所有 L2/L1 标签,窥探过滤器可能仍然有用.我不是多核/多套接字监听协议的专家.我认为相同的监听过滤器也可能有助于过滤套接字之间的监听请求.(在 Broadwell 和更早版本中,只有四路和更高版本的 Xeon 具有用于核间流量的监听过滤器;双插槽 Broadwell Xeon 及更早版本不过滤两个插槽之间的监听请求.)

Or maybe the snoop filter can still be useful without being strictly inclusive of all L2 / L1 tags. I'm not an expert on multi-core / multi-socket snoop protocols. I think the same snoop filter may also help filter snoop requests between sockets. (In Broadwell and earlier, only quad-socket and higher Xeons have a snoop filter for inter-core traffic; dual-socket-only Broadwell Xeon and earlier don't filter snoop requests between the two sockets.)

AMD 锐龙 为核心集群使用单独的 L3 缓存,因此跨多个核心共享的数据必须在每个集群的 L3 中复制.同样重要的是,从一个集群中的核心写入需要更长的时间才能对另一个集群中的核心可见,一致性请求必须通过集群之间的互连.(类似于多插槽 Intel 系统中的插槽之间,其中每个 CPU 包都有自己的 L3.)

AMD Ryzen uses separate L3 caches for clusters of cores, so data shared across many cores has to be duplicated in the L3 for each cluster. Also importantly, writes from a core in one cluster take longer to be visible to a core in another cluster, with the coherency requests having to go over an interconnect between clusters. (Similar to between sockets in a multi-socket Intel system, where each CPU package has its own L3.)

所以这为我们提供了 NUCA(非统一缓存访问),类似于您在多插槽系统中获得的通常的 NUMA(非统一内存访问),其中每个处理器都有一个内置的内存控制器,并访问本地内存比访问连接到另一个套接字的内存要快.

So this gives us NUCA (Non-Uniform Cache Access), analogous to the usual NUMA (Non-Uniform Memory Access) that you get in a multi-socket system where each processor has a memory controller built-in, and accessing local memory is faster than accessing memory attached to another socket.

最近的英特尔多插槽系统具有可配置的监听模式,因此理论上您可以调整 NUMA 机制以使其最适合您正在运行的工作负载.请参阅 英特尔页面关于 Broadwell-Xeon 的表格 + 可用监听模式的说明.

Recent Intel multi-socket systems have configurable snoop modes so in theory you can tune the NUMA mechanism to work best for the workload you're running. See Intel's page about Broadwell-Xeon for a table + description of the available snoop modes.

另一个进步/演变是L3 中的自适应替换策略在 IvyBridge 和更高版本.当某些数据具有时间局部性但工作集的其他部分要大得多时,这可以减少污染.(即,使用标准 LRU 替换循环遍历一个巨大的数组将驱逐所有内容,让 L3 缓存仅缓存不会很快再次触及的数组中的数据.自适应替换尝试缓解该问题.)

Another advance / evolution is an adaptive replacement policy in the L3 on IvyBridge and later. This can reduce pollution when some data has temporal locality but other parts of the working set are much larger. (i.e. looping over a giant array with standard LRU replacement will evict everything, leaving L3 cache only caching data from the array that won't be touched again soon. Adaptive replacement tries to mitigate that problem.)

进一步阅读:

  • What Every Programmer Should Know About Memory?
  • Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?? (Single-threaded memory bandwidth on many-core Xeon CPUs is limited by max_concurrency / latency, not DRAM bandwidth).
  • http://users.atw.hu/instlatx64/ for memory-performance timing results
  • http://www.7-cpu.com/ for cache / TLB organization and latency numbers.
  • http://agner.org/optimize/ for microarchitectural details (mostly about the execution pipeline, not memory), and asm / C++ optimization guides.
  • Stack Overflow's x86 tag wiki has a performance section, with links to those and more.

这篇关于英特尔酷睿 i7 处理器使用了哪种缓存映射技术?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆