L2 TLB 未命中后会发生什么? [英] What happens after a L2 TLB miss?

查看：37 发布时间：2021/12/8 11:26:30 performance x86 cpu cpu-architecture tlb

本文介绍了L2 TLB 未命中后会发生什么?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我很难理解当转换后备缓冲区的前两个级别导致未命中时会发生什么?

I'm struggling to understand what happens when the first two levels of the Translation Lookaside Buffer result in misses?

我不确定页面行走"是否发生在特殊的硬件电路中，或者页表是否存储在 L2/L3 缓存中，或者它们是否只驻留在主内存中.

I am unsure whether "page walking" occurs in special hardware circuitry, or whether the page tables are stored in the L2/L3 cache, or whether they only reside in main memory.

推荐答案

(其中一些是 x86 和 Intel 特定的.大多数关键点适用于任何执行硬件页面遍历的 CPU.我还讨论了像 MIPS 这样的 ISA使用软件处理 TLB 未命中.)

(Some of this is x86 and Intel-specific. Most of the key points apply to any CPU that does hardware page walks. I also discuss ISAs like MIPS that handle TLB misses with software.)

现代 x86 微体系结构具有专用的页面遍历硬件.他们甚至可以在 TLB 未命中实际发生之前推测性地执行页面遍历以加载 TLB 条目.为了支持硬件虚拟化，page-walker 可以处理主机 VM 内的访客页表.(客户机物理内存 = 主机虚拟内存，或多或少.VMWare 发表了一篇论文EPT 摘要和 Nehalem 基准测试).

Modern x86 microarchitectures have dedicated page-walk hardware. They can even speculatively do page-walks to load TLB entries before a TLB miss actually happens. And to support hardware virtualization, the page-walkers can handle guest page tables inside a host VM. (Guest physical memory = host virtual memory, more or less. VMWare published a paper with a summary of EPT, and benchmarks on Nehalem).

Skylake 甚至可以同时进行两页漫游，请参阅英特尔优化手册的第 2.1.3 节.(英特尔还将页面拆分负载惩罚从 ~100 降低到 ~5 或 10 个额外的延迟周期，与缓存行拆分大致相同，但吞吐量更差.这可能是相关的，或者可能添加了第二个页面行走单元是对发现页面拆分访问(和 TLB 未命中?)比他们之前在实际工作负载中估计的更重要的单独回应.

Skylake can even have two page walks in flight at once, see Section 2.1.3 of Intel's optimization manual. (Intel also lowered the page-split load penalty from ~100 to ~5 or 10 extra cycles of latency, about the same as a cache-line split but worse throughput. This may be related, or maybe adding a 2nd page-walk unit was a separate response to discovering that page split accesses (and TLB misses?) were more important than they had previously estimated in real workloads).

一些微架构通过处理它来保护您免受推测性页面漫游当未缓存的 PTE 被推测加载但随后在第一次真正使用条目之前通过存储修改到页表时，作为错误推测.即窥探存储到页表条目的仅推测性 TLB 条目的页表条目，这些条目尚未被任何早期指令在架构上引用.

Some microarchitectures protect you from speculative page-walks by treating it as mis-speculation when an un-cached PTE is speculatively loaded but then modified with a store to the page table before the first real use of the entry. i.e. snoop for stores to the page table entries for speculative-only TLB entries that haven't been architecturally referenced by any earlier instructions.

(Win9x 依赖于此，CPU 供应商关心的是不破坏现有的重要代码.在编写 Win9x 时，当前的 TLB 无效规则尚不存在，因此它甚至不是错误；请参阅 Andy Glew 的下面引用的评论).AMD 推土机系列违反了这一假设，只为您提供 x86 手册在纸上说的内容.

(Win9x depended on this, and not breaking important existing code is something CPU vendors care about. When Win9x was written, the current TLB-invalidation rules didn't exist yet so it wasn't even a bug; see Andy Glew's comments quoted below). AMD Bulldozer-family violates this assumption, giving you only what the x86 manuals say on paper.

页面遍历硬件生成的页面表加载可以命中 L1、L2 或 L3 缓存. Broadwell 性能计数器，例如，可以在您选择的 L1、L2、L3 或内存(即缓存未命中)中计算页面遍历命中.事件名称为 PAGE_WALKER_LOADS.DTLB_L1，表示 L1+FB 中的 DTLB page walker 命中次数，其他表示 ITLB 和其他级别的缓存.

The page-table loads generated by the page-walk hardware can hit in L1, L2, or L3 caches. Broadwell perf counters, for example, can count page-walk hits in your choice of L1, L2, L3, or memory (i.e. cache miss). The event name is PAGE_WALKER_LOADS.DTLB_L1 for Number of DTLB page walker hits in the L1+FB, and others for ITLB and other levels of cache.

自从现代您需要在您认为不需要的情况下刷新 TLB.Intel 和 AMD 实际上是这样做的，根据这篇论文(第 3 节)).ARM 也是如此，他们的中间表遍历缓存

Since modern page tables use a radix-tree format with page directory entries pointing to the tables of page table entries, higher-level PDEs (page directory entries) can be worth caching inside the page-walk hardware. This means you need to flush the TLB in cases where you might think you didn't need to. Intel and AMD actually do this, according to this paper (section 3). So does ARM, with their Intermediate table walk cache

那篇论文说 AMD CPU 上的页面遍历加载会忽略 L1，但会通过 L2.(也许是为了避免污染 L1，或减少对读端口的争用).无论如何，这使得在页面遍历硬件中缓存一些高级 PDE(每个 PDE 涵盖许多不同的翻译条目)更有价值，因为一连串的指针追踪成本更高，延迟更高.

That paper says that page-walk loads on AMD CPUs ignore L1, but do go through L2. (Perhaps to avoid polluting L1, or to reduce contention for read ports). Anyway, this makes caching a few high-level PDEs (that each cover many different translation entries) inside the page-walk hardware even more valuable, because a chain of pointer-chasing is more costly with higher latency.

但请注意，英特尔保证不会对 TLB 条目进行负缓存.将页面从无效更改为有效不需要 invlpg.(因此，如果一个真正的实现确实想要进行这种负面缓存，它必须窥探或以某种方式仍然实现英特尔手册保证的语义.)

But note that Intel guarantees no negative caching of TLB entries. Changing a page from Invalid to Valid doesn't require invlpg. (So if a real implementation does want to do that kind of negative caching, it has to snoop or somehow still implement the semantics guaranteed by Intel manuals.)

但是有一些旧的 Cyrix CPU 确实执行负缓存，不过.跨供应商的 x86 保证的通用子集并不总是像英特尔的那样强大.不过，64 位内核应该能够安全地将 PTE 从不存在更改为存在，而无需 invlpg，因为那些 Cyrix 芯片仅支持 32 位.(如果 Intel、AMD 和 Via 手册都同意它是安全的；任何其他 x86-64 供应商的 IDK.)

But there are old Cyrix CPUs that do perform negative caching, though. The common subset of x86 guarantees across vendors isn't always as strong as Intel's. 64-bit kernels should safely be able to change a PTE from not-present to present without invlpg, though, because those Cyrix chips were 32-bit-only. (If Intel, AMD, and Via manuals all agree that it's safe; IDK of any other x86-64 vendors.)

(历史注释:Andy Glew 对此问题重复的回答关于电子产品.SE 说在 P5 和更早的版本中，硬件页面遍历加载绕过了内部 L1 缓存(它通常是直写式的，因此这使得页面遍历与商店保持一致).IIRC，我的 Pentium MMX 主板在主板上有 L2 缓存，可能是作为内存端缓存.Andy 还确认 P6 和更高版本确实从正常的 L1d 缓存加载.

(Historical note: Andy Glew's answer to a duplicate of this question over on electronics.SE says that in P5 and earlier, hardware page-walk loads bypassed the internal L1 cache (it was usually write-through so this made pagewalk coherent with stores). IIRC, my Pentium MMX motherboard had L2 cache on the mobo, perhaps as a memory-side cache. Andy also confirms that P6 and later do load from the normal L1d cache.

另一个答案的末尾也有一些有趣的链接，包括我在最后一段末尾链接的论文.它似乎还认为操作系统可能会在页面错误(HW pagewalk 找不到条目)时更新 TLB 本身，而不仅仅是页表，并且想知道是否可以在 x86 上禁用 HW 页面遍历.(但实际上操作系统只是修改了内存中的页表，并且从 #PF 返回重新运行错误指令，因此这次 HW pagewalk 会成功.)也许该论文正在考虑像 MIPS 这样的 ISA软件 TLB 管理/错误处理是可能的.

That other answer has some interesting links at the end, too, including the paper I linked at the end of last paragraph. It also seems to think the OS might update the TLB itself, rather than just the page table, on a page fault (HW pagewalk doesn't find an entry), and wonders if HW page walking can be disabled on x86. (But actually the OS just modifies the page table in memory, and returning from #PF re-runs the faulting instruction so HW pagewalk will succeed this time.) Perhaps the paper is thinking of ISAs like MIPS where software TLB management / miss-handling is possible.

我认为实际上不可能在 P5(或任何其他 x86)上禁用 HW pagewalk.这将需要一种软件方式来使用专用指令(没有)或 wrmsr 或 MMIO 存储来更新 TLB 条目.令人困惑的是，安迪说(在我下面引用的一个线程中)软件 TLB 处理在 P5 上更快.我认为他的意思是如果可能的话会更快.当时他在 Imation(在 MIPS 上)工作，与 x86 不同，SW 页面遍历是一种选择(有时是唯一的选择).

I don't think it's actually possible to disable HW pagewalk on P5 (or any other x86). That would require a way for software to update TLB entries with a dedicated instruction (there isn't one), or with wrmsr or an MMIO store. Confusingly, Andy says (in a thread I quoted below) that software TLB handling was faster on P5. I think he meant would have been faster if it had been possible. He was was working at Imation (on MIPS) at the time, where SW page walk is an option (sometimes the only option), unlike x86.

或者他的意思可能是使用 MSR 提前设置 TLB 条目，以防万一您认为还没有 TLB 条目，从而避免一些页面遍历.显然 386/486 通过特殊寄存器进行了 TLB 条目查询/设置访问:https://retrocomputing.stackexchange.com/questions/21963/how-did-the-test-registers-work-on-the-i386-and-the-i486但是有该 386/486 功能可能没有等效的 P5 MSR.
AFAIK，即使在 386/486 上，也没有办法让软件功能的 TLB 未命中陷阱(禁用分页?)未命中，至少在 386/486 上.

Or perhaps he meant using MSRs to set up TLB entries ahead of time in cases where you expect there not to already be one, avoiding some page walks. Apparently 386 / 486 had TLB-entry query / set access via special registers: https://retrocomputing.stackexchange.com/questions/21963/how-did-the-test-registers-work-on-the-i386-and-the-i486 But there's probably no P5 MSR equivalent for that 386/486 functionality.
AFAIK, there wasn't a way to have a TLB miss trap to a software function (with paging disabled?) even on 386/486, so you couldn't fully avoid the HW page walker, just prime the TLB to avoid some TLB misses, at least on 386/486.

正如 Paul Clayton 指出的(关于另一个关于 TLB 未命中的问题), 硬件页面遍历的一大优势是 TLB 未命中不一定会使 CPU 停顿.(乱序执行正常进行，直到重新排序缓冲区填满，因为加载/存储无法退出.退出按顺序发生，因为 CPU 不能正式提交任何不应该发生的事情，如果上一条指令出错.)

As Paul Clayton points out (on another question about TLB misses), the big advantage of hardware page-walks is that TLB misses don't necessarily stall the CPU. (Out-of-order execution proceeds normally, until the re-order buffer fills because the load/store can't retire. Retirement happens in-order, because the CPU can't officially commit anything that shouldn't have happened if a previous instruction faulted.)

顺便说一句，有可能构建一个 x86 CPU，通过捕获到微代码而不是专用硬件状态机来处理 TLB 未命中.这会降低(很多?)性能，并且可能不值得推测性地触发(因为从微代码发出 uops 意味着您不能从正在运行的代码发出指令.)

BTW, it would probably be possible to build an x86 CPU that handles TLB misses by trapping to microcode instead of having dedicated a hardware state-machine. This would be (much?) less performant, and maybe not worth triggering speculatively (since issuing uops from microcode means you can't be issuing instructions from the code that's running.)

如果您在单独的硬件线程中运行这些 uops，那么理论上微编码 TLB 处理可能并不可怕(有趣的想法)，SMT 风格.您需要它的启动/停止开销比普通超线程少得多，以便从单线程切换到两个活动的逻辑核心(必须等待事物耗尽，直到它可以对 ROB、存储队列等进行分区)，因为与通常的逻辑核心相比，它将非常频繁地启动/停止.但是如果它不是真的一个完全独立的线程而只是一些单独的退休状态，那么这可能是可能的，因此其中的缓存未命中不会阻止主代码的退休，并让它使用几个隐藏的内部临时登记.它必须运行的代码由 CPU 设计人员选择，因此额外的硬件线程不必接近 x86 内核的完整架构状态.它很少需要做任何存储(也许只是为了 PTE 中的访问标志?)，所以让这些存储使用与主线程相同的存储队列也不错.您只需将前端分区以混合 TLB 管理 uops，并让它们与主线程无序执行.如果您可以将每次 pagewalk 的 uop 数量保持在较小的水平，它可能不会很糟糕.

Microcoded TLB handling could in theory be non-terrible if you run those uops in a separate hardware thread (interesting idea), SMT-style. You'd need it to have much less start/stop overhead than normal Hyperthreading for switching from single-thread to both logical cores active (has to wait for things to drain until it can partition the ROB, store queue, and so on) because it will start/stop extremely often compared to a usual logical core. But that may be possible if it's not really a fully separate thread but just some separate retirement state, so cache misses in it don't block retirement of the main code, and have it use a couple hidden internal registers for temporaries. The code it has to run is chosen by the CPU designers, so the extra HW thread doesn't have to anywhere near the full architectural state of an x86 core. It rarely has to do any stores (maybe just for the accessed flags in PTEs?), so it wouldn't be bad to let those stores use the same store queue as the main thread. You'd just partition the front-end to mix in the TLB-management uops and let them execute out of order with the main thread. If you could keep the number of uops per pagewalk small, it might not suck.

实际上没有 CPU 做硬件"；我知道在单独的硬件线程中使用微代码进行页面遍历，但这是理论上的可能性.

No CPUs actually do "HW" page-walks with microcode in a separate HW thread that I'm aware of, but it is a theoretical possibility.

在某些 RISC 架构(如 MIPS)中，操作系统内核负责处理TLB 未命中.TLB 未命中导致执行内核的 TLB 未命中中断处理程序.这意味着操作系统可以在此类架构上自由定义自己的页表格式.如果 CPU 不知道页表格式，我想在写入后将页面标记为脏还需要一个陷阱到操作系统提供的例程.

In some RISC architectures (like MIPS), the OS kernel is responsible for handling TLB misses. TLB misses result in execution of the kernel's TLB miss interrupt handler. This means the OS is free to define its own page table format on such architectures. I guess marking a page as dirty after a write also requires a trap to an OS-provided routine, if the CPU doesn't know about page table format.

这一章来自操作系统教科书虚拟内存、页表和 TLB.它们描述了软件管理的 TLB(MIPS、SPARCv9)和硬件管理的 TLB (x86) 之间的区别.一篇论文，看看几个内存管理单元，TLB-Refill Mechanisms, and Page Table Organizations 展示了一些来自 Ultrix 中 TLB 未命中处理程序的示例代码，如果你想要一个真实的例子.

This chapter from an operating systems textbook explains virtual memory, page tables, and TLBs. They describe the difference between software-managed TLBs (MIPS, SPARCv9) and hardware-managed TLBs (x86). A paper, A Look at Several Memory Management Units, TLB-Refill Mechanisms, and Page Table Organizations shows some example code from what is says is the TLB miss handler in Ultrix, if you want a real example.

CPU 如何通过 TLB 和缓存? 复制这个.
VIPT 缓存:TLB 与连接之间的连接缓存? - 加载端口/加载执行单元的内部结构，它与从索引集中获取标签/数据并行访问 dTLB.
什么是 PDE 缓存?
在 x86-64 中测量 TLB 未命中处理成本描述 Westmere 的 Page Walk Cycles 性能计数器.(显然是新的第二代 Nehalem = Westmere)
https://lwn.net/Articles/379748/(Linux 大页面支持/性能，讨论一些关于 PowerPC 和 x86 的内容，并使用 oprofile 来计算页面遍历周期)
每位程序员都应该了解哪些内存?
从英特尔 CPUID 结果中了解 TLB我的回答包括有关 TLB 的一些背景知识，包括为什么跨内核共享 L3TLB 没有意义.(总结:因为与数据不同，页面转换是线程私有的.此外，更多/更好的页面遍历硬件和 TLB 预取在更多情况下有助于降低 L1i/dTLB 未命中的平均成本.)

How does CPU make data request via TLBs and caches? A duplicate of this.

VIPT Cache: Connection between TLB & Cache? - the internals of a load port / load execution unit that accesses the dTLB in parallel with fetching tags/data from the indexed set.

What is PDE cache?

Measuring TLB miss handling cost in x86-64 Describes Westmere's perf counter for Page Walk Cycles. (apparently new with 2nd-gen-Nehalem = Westmere)

https://lwn.net/Articles/379748/ (Linux hugepage support/performance, talks some about PowerPC and x86, and using oprofile to count page-walk cycles)

What Every Programmer Should Know About Memory?

Understanding TLB from CPUID results on Intel my answer includes some background on TLBs, including why it wouldn't make sense to have a shared L3TLB across cores. (Summary: because unlike data, page translations are thread-private. Also, more / better page-walk hardware and TLB prefetch does more to help reduce the average cost of an L1i/dTLB miss in more cases.)

英特尔开始运行页表遍历缓存而不是绕过缓存的主要原因是性能.在 P6 之前，页表遍历很慢，没有从缓存中受益，并且是非推测性的.足够慢以至于软件 TLB 未命中处理是性能上的胜利¹.P6 通过推测性地执行 TLB 丢失、使用缓存以及缓存页面目录条目等中间节点来加速 TLB 丢失.

The main reason Intel started running the page table walks through the cache, rather than bypassing the cache, was performance. Prior to P6 page table walks were slow, not benefitting from cache, and were non-speculative. Slow enough that software TLB miss handling was a performance win¹. P6 sped TLB misses up by doing them speculatively, using the cache, and also by caching intermediate nodes like page directory entries.

顺便说一句，AMD 不愿意投机性地进行 TLB 未命中处理.我想是因为他们受到了 DEC VAX Alpha 建筑师的影响.一位 DEC Alpha 架构师相当强调地告诉我，对 TLB 未命中的推测性处理(例如 P6 所做的)是不正确的，并且永远不会奏效.当我在 2002 年左右到达 AMD 时，他们仍然有一种叫做TLB Fence"的东西.- 不是栅栏指令，而是 rop 或微码序列中 TLB 未命中的点可能或不可能发生 - 恐怕我不记得它是如何工作的.

By the way, AMD was reluctant to do TLB miss handling speculatively. I think because they were influenced by DEC VAX Alpha architects. One of the DEC Alpha architects told me rather emphatically that speculative handling of TLB misses, such as P6 was doing, was incorrect and would never work. When I arrived at AMD circa 2002 they still had something called a "TLB Fence" - not a fence instruction, but a point in the rop or microcode sequence where TLB misses either could or could not be allowed to happen - I am afraid that I do not remember exactly how it worked.

所以我认为 Bulldozer 并没有放弃 TLB 和页表行走一致性，不管这意味着什么，因为 Bulldozer 可能是第一台进行适度激进 TLB 未命中处理的 AMD 机器.

so I think that it is not so much that Bulldozer abandoned TLB and page table walking coherency, whatever that means, as that Bulldozer may have been the first AMD machine to do moderately aggressive TLB miss handling.

回想一下，当 P6 启动时，P5 并没有发布:现有的 x86 都做了缓存绕过页表按顺序遍历，非推测性的，没有异步预取，而是通过缓存写入.IE.它们是缓存一致的，操作系统可以依赖 TLB 条目的确定性替换.IIRC 我为 TLB 条目以及数据和指令缓存编写了关于推测性和非确定性可缓存性的架构规则.你不能责怪 Windows、UNIX 和 Netware 等操作系统没有遵循当时并不存在的页表和 TLB 管理规则.

recall that when P6 was started P5 was not shipping: the existing x86es all did cache bypass page table walking in-order, non-speculatively, no asynchronous prefetches, but on write through caches. I.e. They WERE cache coherent, and the OS could rely on deterministic replacement of TLB entries. IIRC I wrote those architectural rules about speculative and non-deterministic cacheability, both for TLB entries and for data and instruction caches. You can't blame OSes like Windows and UNIX and Netware for not following page table and TLB management rules that did not exist at the time.

IIRC 我为 TLB 条目以及数据和指令缓存编写了关于推测性和非确定性可缓存性的架构规则.你不能责怪 Windows、UNIX 和 Netware 等操作系统没有遵循当时并不存在的页表和 TLB 管理规则.

IIRC I wrote those architectural rules about speculative and non-deterministic cacheability, both for TLB entries and for data and instruction caches. You can't blame OSes like Windows and UNIX and Netware for not following page table and TLB management rules that did not exist at the time.

脚注 1:这是我之前提到的令人惊讶的声明，可能是指使用 MSR 来准备 TLB 以希望避免一些页面遍历.

Footnote 1: This is the surprising claim I mentioned earlier, possibly referring to using MSRs to prime the TLB to hopefully avoid some page walks.

(2) 对于 P6，我最大的遗憾之一是我们没有提供指令内 TLB 一致性支持.一些指令不止一次访问同一页面.同一指令中的不同 uops 可能会为同一地址获得不同的翻译.如果我们让微码能够保存物理地址转换，然后使用它，恕我直言，情况会更好.

(2) one of my biggest regrets wrt P6 is that we did not provide Intra-instruction TLB consistency support. Some instructions access the same page more than once. It was possible for different uops in the same instruction to get different translations for the same address. If we had given microcode the ability to save a physical address translation, and then use that, things would have been better IMHO.

(2a) 我在加入 P6 时是 RISC 的支持者，我的态度是让 SW(微代码)来做".

(2a) I was a RISC proponent when I joined P6, and my attitude was "let SW (microcode) do it".

(2a') 最令人尴尬的错误之一与内存中的带进位相加有关.在早期的微码中.加载将继续，进位标志将被更新，并且存储可能会出错——但进位标志已被更新，因此无法重新启动指令.//这是一个简单的微码修复，在写入进位标志之前进行存储 - 但是一个额外的 uop 足以使该指令不适合中速"；ucode系统.

(2a') one of the most embarrassing bugs was related to add-with-carry to memory. In early microcode. The load would go, the carry flag would be updated, and the store could fault -but the carry flag had already been updated, so the instruction could not be restarted. // it was a simple microcode fix, doing the store before the carry flag was written - but one extra uop was enough to make that instruction not fit in the "medium speed" ucode system.

(3) 无论如何 - 主要的支持"P6 及其后代用于处理 TLB 一致性问题是在报告错误之前在退休时重新检查页表.当页表说不应该有一个错误时，这避免了通过报告错误来混淆操作系统.

(3) Anyway - the main "support" P6 and its descendants gave to handling TLB coherency issues was to rewalk the page tables at retirement before reporting a fault. This avoided confusing the OS by reporting a fault when the page tables said there should not be one.

(4) 元评论:我认为任何架构都没有正确定义缓存无效 TLB 条目的规则.//AFAIK 大多数处理器不缓存无效的 TLB 条目 - 除了可能的 Itanium 及其 NAT (Not A Thing) 页面.但是有一个真正的需要:推测性内存访问可能是对狂野地址、错过 TLB、执行昂贵的页表遍历、减慢其他指令和线程的速度——然后一遍又一遍地这样做，因为这是一个错误地址，无需遍历页表"不记得了.//我怀疑 DOS 攻击可以利用这个.

(4) meta comment: I don't think that any architecture has properly defined rules for caching of invalid TLB entries. // AFAIK most processors do not cache invalid TLB entries - except possibly Itanium with its NAT (Not A Thing) pages. But there's a real need: speculative memory accesses may be to wild addresses, miss the TLB, do an expensive page table walk, slowing down other instructions and threads - and then doing it over and over again because the fact that "this is a bad address, no need to walk the page tables" is not remembered. // I suspect that DOS attacks could use this.

(4') 更糟糕的是，操作系统可能会隐含假设无效的翻译永远不会被缓存，因此在从无效转换为有效时不会执行 TLB 无效或 MP TLB 击倒.//更糟^2:假设您正在缓存页表缓存的内部节点.想象一下，PD 包含所有无效的 PDE；更糟糕的是^3，PD 包含有效的 d 个 PDE，这些 PDE 指向所有无效的 PT.你仍然允许缓存那些 PDE 吗?操作系统何时需要使条目无效?

(4') worse, OSes may make implicit assumptions that invalid translations are never cached, and therefore not do a TLB invalidation or MP TLB shoot down when transitioning from invalid to valid. // Worse^2: imagine that you are caching interior nodes of the page table cache. Imagine that PD contains all invalid PDE; worse^3, that the PD contains valid d PDEs that point to PTs that are all invalid. Are you still allowed to cache those PDEs? Exactly when does the OS need to invalidate an entry?

(4'') 因为使用处理器间中断的 MP TLB 击落是昂贵的，操作系统性能人员(就像我以前一样)总是在争论我们不需要在更改 PTE 后使 TLB 从无效到有效"或从有效的只读变为具有不同地址的有效可写".或者在改变 PDE 以指向不同的 PT 后，我们不需要使 TLB 无效，其 PTE 与原始 PT 完全相同......".//很多非常巧妙的论点.不幸的是并不总是正确的.

(4'') because MP TLB shoot downs using interprocessor interrupts were expensive, OS performance guys (like I used to be) are always making arguments like "we don't need to invalidate the TLB after changing a PTE from invalid to valid" or "from valid read-only to valid writable with a different address". Or "we don't need to invalidate the TLB after changing a PDE to point to a different PT whose PTEs are exactly the same as the original PT...". // Lots of great ingenious arguments. Unfortunately not always correct.

我的一些计算机架构师朋友现在支持一致的 TLB:像数据缓存一样监听写入的 TLB.主要是为了允许我们构建更积极的 TLB 和页表缓存，如果叶节点和内部节点的有效和无效条目.并且不必担心操作系统人员的假设.//我还没到:对于低端硬件来说太贵了.但在高端可能值得做.

Some of my computer architect friends now espouse coherent TLBs: TLBs that snoop writes just like data caches. Mainly to allow us to build even more aggressive TLBs and page table caches, if both valid and invalid entries of leaf and interior nodes. And not to have to worry about OS guys' assumptions. // I am not there yet: too expensive for low end hardware. But might be worth doing at high end.

me:天啊，这就是内存目标 ADC 中额外的 ALU uop 的来源，即使是在 Core2 和 SnB 系列上?万万没想到，却被它迷惑了.

me: Holy crap, so that's where that extra ALU uop comes from in memory-destination ADC, even on Core2 and SnB-family? Never would have guessed, but had been puzzled by it.

Andy:通常当你做 RISC 事情"时需要额外的指令或微指令，以谨慎的顺序.而如果你有CISCy"支持，比如特殊的硬件支持，这样一条指令就是一个事务，要么全做要么全不做，可以使用更短的代码序列.

Andy: often when you "do the RISC thing" extra instructions or micro instructions are required, in a careful order. Whereas if you have "CISCy" support, like special hardware support so that a single instruction is a transaction, either all done or all not done, shorter code sequences can be used.

类似的东西适用于自修改代码:与其说我们想让自修改代码运行得更快，不如说是试图使自修改代码的遗留机制 - 为像 CPUID 这样的序列化指令排空管道 -比仅仅窥探 Icache 和管道要慢.但是，这同样适用于高端机器:在低端机器上，传统机制足够快且便宜.

Something similar applies to self modifying code: it was not so much that we wanted to make self modifying code run fast, as that trying to make the legacy mechanisms for self modifying code - draining the pipe for serializing instructions like CPUID - were slower than just snooping the Icache and pipeline. But, again, this applies to a high end machine: on a low end machine, the legacy mechanisms are fast enough and cheap.

同上内存排序.高端窥探速度更快；低端排水更便宜.

Ditto memory ordering. High end snooping faster; low end draining cheaper.

很难维持这种二分法.

很常见的是，特定实现必须实现与架构声明兼容但比架构声明更强的规则.但并非所有实现都必须以相同的方式进行.

It is pretty common that a particular implementation has to implement rules compatible with but stronger than the architectural statement. But not all implementations have to do it the same way.

这个评论线程是关于安迪对一个关于自我修改代码和查看陈旧指令的问题的回答；真实 CPU 超出纸面要求的另一种情况，因为如果您没有跟踪分支之间发生的事情，那么始终监听 EIP/RIP 附近的商店实际上比仅在分支指令上重新同步更容易.

This comment thread was on Andy's answer to a question about self-modifying code and seeing stale instructions; another case where real CPUs go above and beyond the requirements on paper, because it's actually easier to always snoop for stores near EIP/RIP than to re-sync only on branch instructions if you didn't keep track of what happened between branches.

这篇关于L2 TLB 未命中后会发生什么?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

L2 TLB 未命中后会发生什么? [英] What happens after a L2 TLB miss?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

L2 TLB 未命中后会发生什么? [英] What happens after a L2 TLB miss?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭