TLB是否在多个内核之间共享? [英] Is the TLB shared between multiple cores?

查看:360
本文介绍了TLB是否在多个内核之间共享?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我听说MLB维护TLB,而不是CPU缓存.
那么,一个TLB是否存在于CPU上并在所有处理器之间共享,或者每个处理器都有自己的TLB缓存?

I've heard that TLB is maintained by the MMU not the CPU cache.
Then Does One TLB exist on the CPU and is shared between all processor or each processor has its own TLB cache?

有人可以解释一下MMU与L1,L2缓存之间的关系吗?

Could anyone please explain relationship between MMU and L1, L2 Cache?

推荐答案

TLB缓存页表中列出的翻译.每个CPU内核可以在具有不同页表的不同上下文中运行.如果它是一个单独的单元",则将其称为MMU,因此每个内核都有自己的MMU.任何共享的缓存始终会被物理索引/物理标记,因此它们将基于MMU后的物理地址进行缓存.

The TLB caches the translations listed in the page table. Each CPU core can be running in a different context, with different page tables. This is what you'd call the MMU, if it was a separate "unit", so each core has its own MMU. Any shared caches are always physically-indexed / physically tagged, so they cache based on post-MMU physical address.

TLB是(PTE的)缓存,因此从技术上讲,它只是实现细节,可能会因微体系结构(在x86 体系结构的不同实现之间)而异.

The TLB is a cache (of PTEs), so technically it's just an implementation detail that could vary by microarchitecture (between different implementations of the x86 architecture).

在实践中,真正变化的只是大小. 2级TLB现在很常见,可以将完整的TLB遗漏降至最低,但仍然足够快,每个时钟周期可以进行3次转换.

In practice, all that really varies is the size. 2-level TLBs are common now, to keep full TLB misses to a minimum but still be fast enough allow 3 translations per clock cycle.

重新遍历页表(在本地L1数据或L2缓存中可能很热)来重建TLB条目要比尝试在内核之间共享TLB条目要快得多.是什么来确定避免TLB遗漏的极端情况的下限,与数据缓存不同,数据缓存是必须脱核到共享L3缓存或从芯片外到DRAM的最后一道防线. L3小姐.

It's much faster to just re-walk the page tables (which can be hot in local L1 data or L2 cache) to rebuild a TLB entry than to try to share TLB entries across cores. This is what sets the lower bound on what extremes are worth going to in avoiding TLB misses, unlike with data caches which are the last line of defence before you have to go off-core to shared L3 cache, or off-chip to DRAM on an L3 miss.

例如,Skylake在每个核心上添加了第2个页面遍历单元.对于内核无法有效共享TLB条目(来自不同进程的线程,或不接触许多共享的虚拟页面)的工作负载而言,良好的页面遍历是必不可少的.

For example, Skylake added a 2nd page-walk unit (to each core). Good page-walking is essential for workloads where cores can't usefully share TLB entries (threads from different processes, or not touching many shared virtual pages).

共享的TLB意味着invlpg要在您执行更改页表时使缓存的翻译无效,因此总是必须脱核. (尽管实际上,操作系统需要确保使用之类的内核间通信的软件方法,在运行munmap之类的过程中,运行多线程进程中其他线程的其他内核的私有TLB条目被击落". -处理器中断).

A shared TLB would mean that invlpg to invalidate cached translations when you do change a page table would always have to go off-core. (Although in practice an OS needs to make sure other cores running other threads of a multi-threaded process have their private TLB entries "shot down" during something like munmap, using software methods for inter-core communication like an IPI (inter-processor interrupt).)

但是使用私有TLB时,上下文切换到新进程只需设置新的CR3(顶级页面目录指针)并使该核心的整个TLB无效,而无需打扰其他核心或全局跟踪任何内容.

But with private TLBs, a context switch to a new process can just set a new CR3 (top-level page-directory pointer) and invalidate this core's whole TLB without having to bother other cores or track anything globally.

有一个PCID(进程上下文ID)功能,可使用16个左右的ID之一标记TLB条目,以便来自不同进程页表的条目可以在TLB中处于热状态,而无需在上下文切换时刷新.对于共享的TLB,您需要对此进行增强.

There is a PCID (process context ID) feature that lets TLB entries be tagged with one of 16 or so IDs so entries from different process's page tables can be hot in the TLB instead of needing to be flushed on context switch. For a shared TLB you'd need to beef this up.

另一个复杂之处是TLB条目需要跟踪PTE中的脏"和访问"位.它们一定只是PTE的只读缓存.

Another complication is that TLB entries need to track "dirty" and "accessed" bits in the PTE. They're necessarily just a read-only cache of PTEs.

有关在实际CPU中如何将各个部分组合在一起的示例,请参见> David Kanter的文章英特尔的Sandybridge设计.请注意,这些图适用于单个SnB内核. 在大多数CPU中,唯一的核心间共享缓存是最后一级的数据缓存.

For an example of how the pieces fit together in a real CPU, see David Kanter's writeup of Intel's Sandybridge design. Note that the diagrams are for a single SnB core. The only shared-between-cores cache in most CPUs is the last-level data cache.

英特尔的SnB系列设计均在环形总线上使用每核2MiB模块化L3高速缓存.因此,添加更多的内核会在总池中添加更多的L3,以及添加新的内核(每个内核都有自己的L2/L1D/L1I/uop-cache和两级TLB.)

Intel's SnB-family designs all use a 2MiB-per-core modular L3 cache on a ring bus. So adding more cores adds more L3 to the total pool, as well as adding new cores (each with their own L2/L1D/L1I/uop-cache, and two-level TLB.)

这篇关于TLB是否在多个内核之间共享?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆