L2 TLB遗失后会发生什么? [英] What happens after a L2 TLB miss?

查看:216
本文介绍了L2 TLB遗失后会发生什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在努力了解翻译后备缓冲区的前两个级别导致未命中时会发生什么情况?

我不确定在特殊的硬件电路中是否发生页面遍历",或者页面表是否存储在L2/L3高速缓存中,或者它们是否仅驻留在主存储器中.

解决方案

(其中某些是x86和特定于Intel的.大多数关键点适用于执行硬件页面遍历的任何CPU.我还讨论了诸如MIPS之类的ISA.用软件处理TLB遗漏.)

现代的x86微体系结构具有专用的页面遍历硬件.他们甚至可以推测性地进行页面遍历以在未真正发生TLB丢失之前加载TLB条目.为了支持硬件虚拟化,页面漫游器可以处理主机VM中的访客页面表. (来宾物理内存=主机虚拟内存,或多或少.VMWare发表了 EPT的摘要以及Nehalem的基准).

Skylake甚至可以一次飞行两个页面,请参见通过将其视为当以推测方式加载未缓存的PTE但随后在第一次真正使用该条目之前使用存储将其修改到页表时,会进行错误推测.即窥探到页面表条目中存储的任何推测性TLB条目的页面表条目,这些条目在任何先前的说明中都没有在体系结构上进行引用.

(Win9x依赖于此,并且不破坏重要的现有代码是CPU供应商所关心的事情.编写Win9x时,当前的TLB无效规则尚不存在,因此它甚至不是bug;请参阅Andy Glew的下方引述的评论). AMD Bulldozer系列违反了这一假设,仅给您x86手册在纸上说的话.


由页游走硬件生成的页表负载可能会命中L1,L2或L3缓存. 页表使用基数树格式,其中页目录条目指向页表条目的表,更高级别的PDE(页目录条目)可能值得缓存在页面漫游硬件内部.表示 .

该论文说,AMD CPU上的页面遍历加载会忽略L1,但会通过L2. (也许是为了避免污染L1,或减少对读取端口的争用).无论如何,这使得在页面漫游硬件内部缓存一些高级PDE(每个PDE覆盖许多不同的翻译条目)变得更加有价值,因为指针追随链的成本更高且延迟更高.

但是请注意,x86保证不对TLB条目进行负缓存.将页面从无效更改为有效不需要 invlpg . (因此,如果真正的实现确实想要进行这种否定缓存,则它必须监听或以某种方式仍然实现x86手册所保证的语义.)

(历史记录: Andy Glew对此问题重复的回答SE 说,在P5和更早的版本中,硬件分页加载绕过了内部L1缓存(通常是直写操作,因此使分页操作与商店保持一致). ,我的Pentium MMX主板在主板上具有L2高速缓存,也许是作为内存侧的高速缓存.Andy还确认P6和更高版本的确从普通的L1d高速缓存加载.

另一个答案的末尾也有一些有趣的链接,包括我在上一段末尾链接的论文.似乎还认为OS可能会在页面错误(HW Pagewalk找不到条目)上更新TLB本身,而不仅仅是页面表,并且想知道是否可以在x86上禁用HW页面漫游. (但是实际上,操作系统只是修改了内存中的页表,而从#PF返回则重新运行了错误的指令,因此硬件分页浏览这次将成功.)也许本文考虑的是ISA,例如MIPS,其中软件TLB管理/遗漏了-处理是可能的.

我认为实际上不可能在P5(或任何其他x86)上禁用HW Pagewalk.这就需要软件用专用指令(没有一个)或wrmsr或MMIO存储来更新TLB条目的方法.令人困惑的是,安迪(在下面引用的一个线程中)说,在P5上,软件TLB的处理速度更快.我认为他的意思是,如果可能的话,他们会更快.当时他在Imation(在MIPS上)工作,与x86 AFAIK不同,SW翻页是一个选项(有时是唯一的选择).


Paul Clayton指出(关于TLB遗漏的另一个问题) ,硬件分页浏览的最大优点是TLB丢失并不一定会使CPU停顿. (乱序执行正常进行,直到由于加载/存储无法退出而导致重新排序缓冲区填满为止.报废是按顺序发生的,因为如果发生以下情况,CPU无法正式提交不应发生的任何事情:前一条指令有误.)

顺便说一句,可能有可能通过捕获到微码而不是专用的硬件状态机来构建处理TLB丢失的x86 CPU.这将(多少?)性能较差,并且可能不值得进行推测性触发(因为从微码发出uops意味着您无法从正在运行的代码中发出指令.)

如果您在单独的硬件线程中运行这些uops,则从理论上讲,微编码的TLB处理可能不会很糟糕(


软件TLB处理:某些RISC就是这样,而不是x86

在某些RISC体系结构(如MIPS)中,OS内核负责处理TLB错过 . TLB未命中会导致执行内核的TLB未命中中断处理程序.这意味着操作系统可以在此类体系结构上自由定义其自己的页表格式.我想如果CPU不知道页面表格式,那么写后将页面标记为脏页面也需要捕获OS提供的例程.

来自操作系统教科书的这一章解释了虚拟内存,页表和TLB.它们描述了软件管理的TLB(MIPS,SPARCv9)和硬件管理的TLB(x86)之间的区别.论文查看多个内存管理单元, TLB重新填充机制和页表组织显示了一些示例代码,如果您需要一个真实的示例,则该代码来自Ultrix中的TLB未命中处理程序.


其他链接


有关TLB一致性的评论来自Intel P6(Pentium Pro/II/III)的架构师之一Andy Glew,后来在AMD工作.

性能是英特尔开始运行页表遍历高速缓存而不是绕过高速缓存的主要原因.在P6之前,页表移动速度很慢,无法从缓存中受益,并且是非推测性的.足够慢,以至于软件TLB未命中是性能上的胜利 1 . P6加速的TLB会通过使用缓存进行推测性操作,以及通过缓存诸如页面目录条目之类的中间节点来错过它们.

顺便说一句,AMD不愿进行投机性的TLB遗漏处理.我认为是因为它们受到DEC VAX Alpha建筑师的影响.一位DEC Alpha架构师很强调地告诉我,对TLB未命中(例如P6)所做的推测性处理是不正确的,并且永远不会起作用.当我大约在2002年到达AMD时,他们仍然有一个叫做"TLB围栏"的东西. -不是围栏指令,而是在rop或微码序列中可能允许或不允许发生TLB丢失的点-恐怕我不记得它是如何工作的.

因此,我认为Bulldozer放弃TLB和页表步行一致性的意义并不大,因为Bulldozer可能是第一台进行适度激进的TLB未命中处理的AMD机器.

回想一下,当P6启动时,P5并未交付:现有的x86es都缓存了绕过页表的顺序,非推测性地,没有异步预取,而是通过缓存进行写操作. IE.它们缓存一致,操作系统可以依靠确定性替换TLB条目. IIRC我为TLB条目以及数据和指令高速缓存编写了有关推测性和非确定性高速缓存性的那些体系结构规则.您不能责怪Windows,UNIX和Netware之类的OS不遵循当时不存在的页表和TLB管理规则.

IIRC我为TLB条目以及数据和指令高速缓存编写了有关推测性和非确定性高速缓存性的那些体系结构规则.您不能责怪Windows,UNIX和Netware之类的OS不遵循当时不存在的页表和TLB管理规则.

脚注1:据我所知,没有x86 CPU支持软件TLB管理.我认为安迪(Andy)的意思是说会更快".在P5上,因为无论如何它都不会是推测性的或乱序的,并且运行带有物理地址的x86指令(禁用分页以避免Catch-22)将允许对页表加载进行缓存.安迪可能在想MIPS,这是他当时的日常工作.


来自Andy Glew的更多内容来自同一本书,因为这些评论应该在某个地方得到完整的答案.

(2)我对P6的最大遗憾之一是我们没有提供指令内TLB一致性支持.有些说明会多次访问同一页面. 同一指令中的不同对象可能会针对同一地址获得不同的翻译.如果我们为微码提供了保存物理地址转换的功能,然后再使用它,那么恕我直言会更好.

(2a)当我加入P6时,我是RISC的支持者,我的态度是让SW(微码)这样做".

(2a')最为令人尴尬的错误之一与随身携带存储器"有关.在早期的微码中.负载将消失,进位标志将被更新,并且存储可能发生故障-但进位标志已被更新,因此该指令无法重新启动.//这是一个简单的微代码修复,在写入进位标志之前进行存储-但是一个额外的uop足以使该指令不适合中等速度"指令. ucode系统.

(3)无论如何-主要的支持" P6及其后代为处理TLB一致性问题而付出的努力是在报告故障之前在退休时重新浏览页表.当页表说不应存在错误时,通过报告错误来避免混淆操作系统.

(4)元注释:我认为任何体系结构都没有正确定义用于缓存无效TLB条目的规则.//AFAIK多数处理器不会缓存无效的TLB条目-可能是Itanium及其NAT(不是东西)页.但是确实存在需求:推测性存储器访问可能是对通配地址的访问,错过了TLB,进行了昂贵的页表遍历,减慢了其他指令和线程的速度,然后一遍又一遍地进行,因为这是一个错误的地址,无需遍历页表"不记得了.//我怀疑DOS攻击可能会使用它.

(4')更糟糕的是,OS可能会做出隐式假设,即永远不会缓存无效的转换,因此从无效转换为有效时,TLB无效或MP TLB不会下降.//更糟^ 2:假设您正在缓存页表缓存的内部节点.想象一下,PD包含所有无效的PDE;更糟糕的是,^ 3,PD包含指向所有无效的PT的有效d个PDE.您是否仍然可以缓存那些PDE?究竟操作系统何时需要使条目无效?

(4''),因为使用处理器间中断进行MP TLB击落是昂贵的,所以OS性能人员(就像我以前一样)总是提出类似这样的论据:``在将PTE更改为以下值后,我们不需要使TLB无效无效到有效"或从有效的只读到有效的可写用不同的地址".或在改变PDE以指向其PTE与原始PT完全相同的另一PT之后,我们不需要使TLB无效.".//许多出色的论点.不幸的是,并非总是正确的.

我的一些计算机架构师朋友现在拥护连贯的TLB:监听数据的TLB就像数据缓存一样.主要是为了让我们能够构建更具侵略性的TLB和页表缓存(如果叶节点和内部节点的有效条目和无效条目均如此).不必担心操作系统专家的假设.//我还没到那儿:对于低端硬件来说太贵了.但也许值得在高端做.

我:这真是胡扯,那是额外的ALU uop来自内存目标ADC的地方,甚至在Core2和SnB系列上也是如此?永远不会猜到,但是对此感到困惑.

安迪:通常当您做RISC事"时,需要仔细阅读其他说明或微观说明.而如果您有"CISCy"支持,例如特殊的硬件支持,这样一条指令就是一条事务,要么全部完成,要么全部未完成,因此可以使用较短的代码序列.

类似的事情也适用于自修改代码:我们并不是想让自修改代码快速运行,而是试图使旧的自修改代码机制-耗尽用于串行化CPUID等指令的管道-是比监听Icache和管道要慢.但是,这再次适用于高端计算机:在低端计算机上,旧机制足够快且便宜.

Ditto内存排序.高端侦听速度更快;低端排水更便宜.

很难保持这种二分法.

通常,特定的实现必须实现与体系结构声明兼容但比体系结构声明更强大的规则.但是,并非所有的实现都必须采用相同的方式.

此评论主题是Andy回答的有关自我修改代码和看到过时指令的问题;实际CPU超出纸面要求的另一种情况,因为始终侦探EIP/RIP附近的商店实际上比仅在分支指令上重新同步(如果您不跟踪分支之间发生的情况)要容易得多./p>

I'm struggling to understand what happens when the first two levels of the Translation Lookaside Buffer result in misses?

I am unsure whether "page walking" occurs in special hardware circuitry, or whether the page tables are stored in the L2/L3 cache, or whether they only reside in main memory.

解决方案

(Some of this is x86 and Intel-specific. Most of the key points apply to any CPU that does hardware page walks. I also discuss ISAs like MIPS that handle TLB misses with software.)

Modern x86 microarchitectures have dedicated page-walk hardware. They can even speculatively do page-walks to load TLB entries before a TLB miss actually happens. And to support hardware virtualization, the page-walkers can handle guest page tables inside a host VM. (Guest physical memory = host virtual memory, more or less. VMWare published a paper with a summary of EPT, and benchmarks on Nehalem).

Skylake can even have two page walks in flight at once, see Section 2.1.3 of Intel's optimization manual. (Intel also lowered the page-split load penalty from ~100 to ~5 or 10 extra cycles of latency, about the same as a cache-line split but worse throughput. This may be related, or maybe adding a 2nd page-walk unit was a separate response to discovering that page split accesses (and TLB misses?) were more important than they had previously estimated in real workloads).

Some microarchitectures protect you from speculative page-walks by treating it as mis-speculation when an un-cached PTE is speculatively loaded but then modified with a store to the page table before the first real use of the entry. i.e. snoop for stores to the page table entries for speculative-only TLB entries that haven't been architecturally referenced by any earlier instructions.

(Win9x depended on this, and not breaking important existing code is something CPU vendors care about. When Win9x was written, the current TLB-invalidation rules didn't exist yet so it wasn't even a bug; see Andy Glew's comments quoted below). AMD Bulldozer-family violates this assumption, giving you only what the x86 manuals say on paper.


The page-table loads generated by the page-walk hardware can hit in L1, L2, or L3 caches. Broadwell perf counters, for example, can count page-walk hits in your choice of L1, L2, L3, or memory (i.e. cache miss). The event name is PAGE_WALKER_LOADS.DTLB_L1 for Number of DTLB page walker hits in the L1+FB, and others for ITLB and other levels of cache.

Since modern page tables use a radix-tree format with page directory entries pointing to the tables of page table entries, higher-level PDEs (page directory entries) can be worth caching inside the page-walk hardware. This means you need to flush the TLB in cases where you might think you didn't need to. Intel and AMD actually do this, according to this paper (section 3).

That paper says that page-walk loads on AMD CPUs ignore L1, but do go through L2. (Perhaps to avoid polluting L1, or to reduce contention for read ports). Anyway, this makes caching a few high-level PDEs (that each cover many different translation entries) inside the page-walk hardware even more valuable, because a chain of pointer-chasing is more costly with higher latency.

But note that x86 guarantees no negative caching of TLB entries. Changing a page from Invalid to Valid doesn't require invlpg. (So if a real implementation does want to do that kind of negative caching, it has to snoop or somehow still implement the semantics guaranteed by x86 manuals.)

(Historical note: Andy Glew's answer to a duplicate of this question over on electronics.SE says that in P5 and earlier, hardware page-walk loads bypassed the internal L1 cache (it was usually write-through so this made pagewalk coherent with stores). IIRC, my Pentium MMX motherboard had L2 cache on the mobo, perhaps as a memory-side cache. Andy also confirms that P6 and later do load from the normal L1d cache.

That other answer has some interesting links at the end, too, including the paper I linked at the end of last paragraph. It also seems to think the OS might update the TLB itself, rather than just the page table, on a page fault (HW pagewalk doesn't find an entry), and wonders if HW page walking can be disabled on x86. (But actually the OS just modifies the page table in memory, and returning from #PF re-runs the faulting instruction so HW pagewalk will succeed this time.) Perhaps the paper is thinking of ISAs like MIPS where software TLB management / miss-handling is possible.

I don't think it's actually possible to disable HW pagewalk on P5 (or any other x86). That would require a way for software to update TLB entries with a dedicated instruction (there isn't one), or with wrmsr or an MMIO store. Confusingly, Andy says (in a thread I quoted below) that software TLB handling was faster on P5. I think he meant would have been faster if it had been possible. He was was working at Imation (on MIPS) at the time, where SW page walk is an option (sometimes the only option), unlike x86 AFAIK.


As Paul Clayton points out (on another question about TLB misses), the big advantage of hardware page-walks is that TLB misses don't necessarily stall the CPU. (Out-of-order execution proceeds normally, until the re-order buffer fills because the load/store can't retire. Retirement happens in-order, because the CPU can't officially commit anything that shouldn't have happened if a previous instruction faulted.)

BTW, it would probably be possible to build an x86 CPU that handles TLB misses by trapping to microcode instead of having dedicated a hardware state-machine. This would be (much?) less performant, and maybe not worth triggering speculatively (since issuing uops from microcode means you can't be issuing instructions from the code that's running.)

Microcoded TLB handling could in theory be non-terrible if you run those uops in a separate hardware thread (interesting idea), SMT-style. You'd need it to have much less start/stop overhead than normal Hyperthreading for switching from single-thread to both logical cores active (has to wait for things to drain until it can partition the ROB, store queue, and so on) because it will start/stop extremely often compared to a usual logical core. But that may be possible if it's not really a fully separate thread but just some separate retirement state, so cache misses in it don't block retirement of the main code, and have it use a couple hidden internal registers for temporaries. The code it has to run is chosen by the CPU designers, so the extra HW thread doesn't have to anywhere near the full architectural state of an x86 core. It rarely has to do any stores (maybe just for the accessed flags in PTEs?), so it wouldn't be bad to let those stores use the same store queue as the main thread. You'd just partition the front-end to mix in the TLB-management uops and let them execute out of order with the main thread. If you could keep the number of uops per pagewalk small, it might not suck.

No CPUs actually do "HW" page-walks with microcode in a separate HW thread that I'm aware of, but it is a theoretical possibility.


Software TLB handling: some RISCs are like this, not x86

In some RISC architectures (like MIPS), the OS kernel is responsible for handling TLB misses. TLB misses result in execution of the kernel's TLB miss interrupt handler. This means the OS is free to define its own page table format on such architectures. I guess marking a page as dirty after a write also requires a trap to an OS-provided routine, if the CPU doesn't know about page table format.

This chapter from an operating systems textbook explains virtual memory, page tables, and TLBs. They describe the difference between software-managed TLBs (MIPS, SPARCv9) and hardware-managed TLBs (x86). A paper, A Look at Several Memory Management Units, TLB-Refill Mechanisms, and Page Table Organizations shows some example code from what is says is the TLB miss handler in Ultrix, if you want a real example.


Other links


Comments about TLB coherency from Andy Glew, one of the architects on Intel P6 (Pentium Pro / II / III), then later worked at AMD.

The main reason Intel started running the page table walks through the cache, rather than bypassing the cache, was performance. Prior to P6 page table walks were slow, not benefitting from cache, and were non-speculative. Slow enough that software TLB miss handling was a performance win1. P6 sped TLB misses up by doing them speculatively, using the cache, and also by caching intermediate nodes like page directory entries.

By the way, AMD was reluctant to do TLB miss handling speculatively. I think because they were influenced by DEC VAX Alpha architects. One of the DEC Alpha architects told me rather emphatically that speculative handling of TLB misses, such as P6 was doing, was incorrect and would never work. When I arrived at AMD circa 2002 they still had something called a "TLB Fence" - not a fence instruction, but a point in the rop or microcode sequence where TLB misses either could or could not be allowed to happen - I am afraid that I do not remember exactly how it worked.

so I think that it is not so much that Bulldozer abandoned TLB and page table walking coherency, whatever that means, as that Bulldozer may have been the first AMD machine to do moderately aggressive TLB miss handling.

recall that when P6 was started P5 was not shipping: the existing x86es all did cache bypass page table walking in-order, non-speculatively, no asynchronous prefetches, but on write through caches. I.e. They WERE cache coherent, and the OS could rely on deterministic replacement of TLB entries. IIRC I wrote those architectural rules about speculative and non-deterministic cacheability, both for TLB entries and for data and instruction caches. You can't blame OSes like Windows and UNIX and Netware for not following page table and TLB management rules that did not exist at the time.

IIRC I wrote those architectural rules about speculative and non-deterministic cacheability, both for TLB entries and for data and instruction caches. You can't blame OSes like Windows and UNIX and Netware for not following page table and TLB management rules that did not exist at the time.

Footnote 1: to the best of my knowledge, no x86 CPU has supported software TLB management. I think Andy meant to say "would have been faster" on P5, because it couldn't be speculative or out-of-order anyway, and running x86 instructions with physical addresses (paging disabled to avoid a catch-22) would have allowed caching of page-table loads. Andy was maybe thinking of MIPS, which was his day job at the time.


More from Andy Glew from the same thread, because these comments deserve to be in a full answer somewhere.

(2) one of my biggest regrets wrt P6 is that we did not provide Intra-instruction TLB consistency support. Some instructions access the same page more than once. It was possible for different uops in the same instruction to get different translations for the same address. If we had given microcode the ability to save a physical address translation, and then use that, things would have been better IMHO.

(2a) I was a RISC proponent when I joined P6, and my attitude was "let SW (microcode) do it".

(2a') one of the most embarrassing bugs was related to add-with-carry to memory. In early microcode. The load would go, the carry flag would be updated, and the store could fault -but the carry flag had already been updated, so the instruction could not be restarted. // it was a simple microcode fix, doing the store before the carry flag was written - but one extra uop was enough to make that instruction not fit in the "medium speed" ucode system.

(3) Anyway - the main "support" P6 and its descendants gave to handling TLB coherency issues was to rewalk the page tables at retirement before reporting a fault. This avoided confusing the OS by reporting a fault when the page tables said there should not be one.

(4) meta comment: I don't think that any architecture has properly defined rules for caching of invalid TLB entries. // AFAIK most processors do not cache invalid TLB entries - except possibly Itanium with its NAT (Not A Thing) pages. But there's a real need: speculative memory accesses may be to wild addresses, miss the TLB, do an expensive page table walk, slowing down other instructions and threads - and then doing it over and over again because the fact that "this is a bad address, no need to walk the page tables" is not remembered. // I suspect that DOS attacks could use this.

(4') worse, OSes may make implicit assumptions that invalid translations are never cached, and therefore not do a TLB invalidation or MP TLB shoot down when transitioning from invalid to valid. // Worse^2: imagine that you are caching interior nodes of the page table cache. Imagine that PD contains all invalid PDE; worse^3, that the PD contains valid d PDEs that point to PTs that are all invalid. Are you still allowed to cache those PDEs? Exactly when does the OS need to invalidate an entry?

(4'') because MP TLB shoot downs using interprocessor interrupts were expensive, OS performance guys (like I used to be) are always making arguments like "we don't need to invalidate the TLB after changing a PTE from invalid to valid" or "from valid read-only to valid writable with a different address". Or "we don't need to invalidate the TLB after changing a PDE to point to a different PT whose PTEs are exactly the same as the original PT...". // Lots of great ingenious arguments. Unfortunately not always correct.

Some of my computer architect friends now espouse coherent TLBs: TLBs that snoop writes just like data caches. Mainly to allow us to build even more aggressive TLBs and page table caches, if both valid and invalid entries of leaf and interior nodes. And not to have to worry about OS guys' assumptions. // I am not there yet: too expensive for low end hardware. But might be worth doing at high end.

me: Holy crap, so that's where that extra ALU uop comes from in memory-destination ADC, even on Core2 and SnB-family? Never would have guessed, but had been puzzled by it.

Andy: often when you "do the RISC thing" extra instructions or micro instructions are required, in a careful order. Whereas if you have "CISCy" support, like special hardware support so that a single instruction is a transaction, either all done or all not done, shorter code sequences can be used.

Something similar applies to self modifying code: it was not so much that we wanted to make self modifying code run fast, as that trying to make the legacy mechanisms for self modifying code - draining the pipe for serializing instructions like CPUID - were slower than just snooping the Icache and pipeline. But, again, this applies to a high end machine: on a low end machine, the legacy mechanisms are fast enough and cheap.

Ditto memory ordering. High end snooping faster; low end draining cheaper.

It is hard to maintain this dichotomy.

It is pretty common that a particular implementation has to implement rules compatible with but stronger than the architectural statement. But not all implementations have to do it the same way.

This comment thread was on Andy's answer to a question about self-modifying code and seeing stale instructions; another case where real CPUs go above and beyond the requirements on paper, because it's actually easier to always snoop for stores near EIP/RIP than to re-sync only on branch instructions if you didn't keep track of what happened between branches.

这篇关于L2 TLB遗失后会发生什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆