分页浏览会利用共享表吗? [英] Does page walk take advantage of shared tables?

查看:106
本文介绍了分页浏览会利用共享表吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假定两个地址空间共享大量的非连续内存. 系统可能希望在它们之间共享物理页表. 这些表将不使用 Global 位(即使受支持),并且会将它们绑定到 asid (如果受支持).

Suppose two address spaces share a largish lump of non-contiguous memory. The system might want to share physical page table(s) between them. These tables wouldn't use Global bits (even if supported), and would tie them to asids if supported.

有直接的好处,因为与复制,固定的ram等相比,数据缓存的污染更少.

There are immediate benefits since the data cache will be less polluted than by a copy, less pinned ram, etc.

在任何已知的体系结构中,分页浏览是否具有显式的优势? 如果是这样,是否暗示 mmu 已显式缓存& ;?共享基于物理标签的内部页面树节点?

Does the page walk take explicit advantage of this in any known architecture? If so, does that imply the mmu is explicitly caching & sharing interior page tree nodes based on physical tag?

很抱歉有多个问题;这真的是一个细分.我正在尝试确定是否值得为此设计测量测试.

Sorry for the multiple questions; it really is one broken down. I am trying to determine if it is worth devising a measurement test for this.

推荐答案

在现代的x86 CPU(如Sandybridge系列)上,页面遍历可遍历高速缓存层次结构(L1d/L2/L3),所以是的,这里有一个明显的好处.对于不同的页面目录,它们用于指向虚拟地址空间的共享区域的相同子树.或对于某些AMD,通过L2来获取,而跳过L1d.

On modern x86 CPUs (like Sandybridge-family), page walks fetch through the cache hierarchy (L1d / L2 / L3), so yes there's an obvious benefit there for having to different page directories point to the same subtree for a shared region of virtual address space. Or for some AMD, fetch through L2, skipping L1d.

L2 TLB丢失后会发生什么?具有更多详细信息关于页面遍历一定会通过缓存获取的事实,例如拥有Broadwell性能计数器来衡量点击率.

What happens after a L2 TLB miss? has more details about the fact that page-walk definitely fetches through cache, e.g. Broadwell perf counters exist to measure hits.

(" MMU是CPU内核的一部分; L1dTLB与加载/存储执行单元紧密耦合.页面漫游器是一个非常独立的东西,与指令执行并行运行,但仍然是其中一部分核心,并且可以通过推测方式触发等等.因此它紧密耦合,可以通过L1d缓存访问内存.)

("The MMU" is part of a CPU core; the L1dTLB is tightly coupled to load/store execution units. The page walker is a fairly separate thing, though, and runs in parallel with instruction execution, but is still part of the core and can be triggered speculatively, etc. So it's tightly coupled enough to access memory through L1d cache.)

高级PDE(页面目录条目)可能值得在页面漫游硬件内部进行缓存.

Higher-level PDEs (page directory entries) can be worth caching inside the page-walk hardware. Section 3 of this paper confirms that Intel and AMD actually do this in practice, so you need to flush the TLB in cases where you might think you didn't need to.

但是,我认为您不会发现PDE缓存是在顶层页面表的更改中发生的.

在x86上,您要为CR3安装带有mov的新页表;隐式刷新所有缓存的翻译和内部page-walker PDE缓存,如 invlpg 那样一个虚拟地址. (或使用ASID,使来自不同ASID的TLB条目不可用于匹配).

On x86, you install a new page table with a mov to CR3; that implicitly flushes all cached translations and internal page-walker PDE caching, like invlpg does for one virtual address. (Or with ASIDs, makes TLB entries from different ASIDs unavailable for hits).

主要问题是TLB和page-walker内部缓存与主内存/数据缓存不一致.我认为所有执行HW页面遍历的ISA都需要手动刷新TLB,具有x86之类的语义,用于安装新的页表. (像MIPS之类的某些ISA仅执行软件TLB管理,调用特殊的内核TLB-miss处理程序;您的问题不适用于此.)

The main issue is that TLB the and page-walker internal caches are not coherent with main memory / data caches. I think all ISAs that do HW page walks at all require manual flushing of TLBs, with semantics like x86 for installing a new page table. (Some ISAs like MIPS only do software TLB management, invoking a special kernel TLB-miss handler; your question won't apply there.)

是的,他们可以检测到相同的物理地址,但是为了保持理智,您还必须避免使用从存储后到该物理地址的过时缓存数据.

So yes, they could detect same physical address, but for sanity you also have to avoid using stale cached data from after a store to that physical address.

在页表存储和TLB/pagewalk之间没有硬件管理的一致性的情况下,就不可能安全地进行此缓存.

说;一些x86 CPU确实超出了纸上的性能范围,并且与商店之间的一致性有限,但是只有保护您免受投机性的页游走,才能与假定有效但尚未使用的PTE可以修改的OS向后兼容没有invlpg. http://blog.stuffedcow.net/2015/08/pagewalk-coherence/

That said; some x86 CPUs do go beyond what's on paper and do limited coherency with stores, but only protecting you from speculative page walks for backwards compat with OSes that assumed a valid but not-yet-used PTE could be modified without invlpg. http://blog.stuffedcow.net/2015/08/pagewalk-coherence/

因此,微体系结构窥探商店以检测一定范围内的商店并非没有听说过;您可以合理地使商店在探查步行者内部缓存的位置附近侦听地址范围,从而有效地为内部探寻步行者的缓存提供连贯性.

So it's not unheard of for microarchitectures to snoop stores to detect stores to certain ranges; you could plausibly have stores snoop the address ranges near locations the page-walker had internally cached, effectively providing coherence for internal page-walker caches.

现代的x86 确实会通过监听来检测任何飞行指令附近的商店的自我修改代码. 观察在x86上使用self来获取过时指令修改代码在这种情况下,窥探命中是通过将整个后端状态恢复为退役状态来处理的.

Modern x86 does in practice detect self-modifying code by snoop for stores near any in-flight instructions. Observing stale instruction fetching on x86 with self-modifying code In that case snoop hits are handled by nuking the whole back-end state back to retirement state.

因此,您可以可以在理论上设计一种具有高效机制的CPU,以便能够透明地利用此优势,但它具有重要意义成本(将每个商店靠CAM侦听以检查页面行者缓存的地址上的匹配项)的收益非常低.除非我缺少任何东西,否则我认为没有更简单的方法可以做到这一点,所以我敢打赌,没有真正的设计实际上可以做到这一点.

So it's plausible that you could in theory design a CPU with an efficient mechanism to be able to take advantage of this transparently, but it has significant cost (snooping every store against a CAM to check for matches on page-walker-cached addresses) for very low benefit. Unless I'm missing something, I don't think there's an easier way to do this, so I'd bet money that no real designs actually do this.

很难想象在x86之外;几乎所有其他内容都采用较弱"/较少保证"的方法,并且只会监听存储缓冲区(用于存储转发). CAM(content-addressable-memory =硬件哈希表)非常耗电,处理命中的特殊情况会使管道复杂化.尤其是OoO exec管道,在该管道中,到PTE的存储可能直到加载要使用该TLB条目之后才准备好其存储地址.引入更多的管道核武器是一件坏事.

Hard to imagine outside of x86; almost everything else takes a "weaker" / "fewer guarantees" approach and would only snoop the store buffer (for store-forwarding). CAMs (content-addressable-memory = hardware hash table) are power-hungry, and handling the special case of a hit would complicate the pipeline. Especially an OoO exec pipeline where the store to a PTE might not have its store-address ready until after a load wanted to use that TLB entry. Introducing more pipeline nukes is a bad thing.

首先 页面遍历从L1d缓存中获取数据之后(如果L1d中的温度也不高,则更远),然后然后 page-walker机制可以正常运行.

After the first page-walk fetches data from L1d cache (or farther away if it wasn't hot in L1d either), then the usual cache-within-page-walker mechanisms can act normally.

因此,在下一个上下文切换可以受益于page-walker内部缓存之前,可以进一步对附近的页面进行页面遍历.这是有好处的,这是某些真正的硬件所要做的(至少是某些x86;有关其他硬件的IDK).

So further page walks for nearby pages before the next context switch can benefit from page-walker internal caches. This has benefits, and is what some real HW does (at least some x86; IDK about others).

上面所有关于为什么需要对相干页表进行监听的争论都是关于使page-walker内部缓存在上下文切换中保持热"状态.

All the argument above about why this would require snooping for coherent page tables is about having the page-walker internal caches stay hot across a context switch.

L1d可以轻松地做到这一点;行为类似于PIPT(无别名)的VIPT缓存仅基于物理地址进行缓存,不需要在上下文切换时进行刷新.

L1d can easily do that; VIPT caches that behave like PIPT (no aliasing) simply cache based on physical address and don't need flushing on context switch.

如果您经常频繁地上下文切换 ,则ASID可让TLB条目正确保留在高速缓存中.如果您仍然遇到许多TLB未命中的情况,最坏的情况是它们必须从头到尾一直从高速缓存中获取数据. 这确实不错,而且不值得花很多晶体管和功率预算.

If you're context-switching very frequently, the ASIDs let TLB entries proper stay cached. If you're still getting a lot of TLB misses, the worst case is that they have to fetch through cache all the way from the top. This is really not bad and very much not worth spending a lot of transistors and power budget on.

我只考虑裸机上的操作系统,而不考虑带有嵌套页面表的硬件虚拟化. (管理程序虚拟化来宾操作系统的页表).我认为所有相同的论点基本上都适用.遍历页面绝对仍然可以通过缓存获取.

I'm only considering OS on bare metal, not HW virtualization with nested page tables. (Hypervisor virtualizing the guest OS's page tables). I think all the same arguments basically apply, though. Page walk still definitely fetches through cache.

这篇关于分页浏览会利用共享表吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆