对页面使用直写式高速缓存策略时 [英] When use write-through cache policy for pages

查看:115
本文介绍了对页面使用直写式高速缓存策略时的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在阅读MDS攻击文件 RIDL:流氓机上数据加载.将页面设置为回写,直写,写合并或不可缓存,并通过不同的实验确定行填充缓冲区是微体系结构泄漏的原因.

I was reading the MDS attack paper RIDL: Rogue In-Flight Data Load. The set pages as write-back, write-through, write-combined or uncacheable and with different experiments determines that the Line Fill Buffer is the cause of the micro-architectural leaks.

切线:我知道内存是不可缓存的,但是我假设可缓存数据始终缓存在回写缓存中,即,我假设L1,L2和LLC始终是回写缓存.

On a tangent: I was aware that memory can be uncacheable, but I assumed that cacheable data was always cached in a write-back cache, i.e. I assumed that the L1, L2 and LLC were always write-back caches.

我在《计算机体系结构》书中阅读了回写式缓存与直写式缓存之间的差异.它说:

直写式缓存更易于实现,并且可以使用写缓存来实现. 独立于高速缓存来更新内存.此外,阅读未命中 便宜些,因为它们不触发存储器写操作.在另一 一方面,回写式高速缓存导致更少的传输,从而允许更多的带宽 到执行DMA的I/O设备的内存中.此外,减少数量 随着我们沿着层次结构向下移动,转移变得越来越重要. 传输时间增加.通常,层次结构中更深的缓存更多 可能会使用回写而不是直写.

Write-through caches are simpler to implement and can use a write buffer that works independently of the cache to update memory. Furthermore, read misses are less expensive because they do not trigger a memory write. On the other hand, write-back caches result in fewer transfers, which allows more bandwidth to memory for I/O devices that perform DMA. Further, reducing the number of transfers becomes increasingly important as we move down the hierarchy and the transfer times increase. In general, caches further down the hierarchy are more likely to use write-back than write-through.

因此,直写式缓存更易于实现.我可以看到这是一个优势.但是,如果可以通过页表属性设置缓存策略,那么就没有实现上的优势-每个缓存都必须能够以写回或直写方式工作.

So a write-through cache is simpler to implement. I can see how that can be an advantage. But if the caching policy is settable by the page table attributes then there can't be an implementation advantage - every cache needs to be able to work in either write-back or write-through.

  1. 是否每个缓存(L1,L2,LLC)都可以写回或直写模式工作?因此,如果将page属性设置为直写式,那么它们全部都是直写式的吗?
  2. 写合并对于GPU内存很有用;访问硬件寄存器时,不可缓存是好的.应何时将页面设置为直写?这样做有什么好处?
  3. 是否存在任何直写式高速缓存(如果它确实是硬件的属性,而不仅仅是由pagetable属性控制的东西),或者趋势是所有高速缓存都被创建为回写以减少流量?/li>
  1. Can every cache (L1, L2, LLC) work in either write-back or write-through mode? So if the page attribute is set to write-through, then they all will be write-through?
  2. Write combining is useful for GPU memory; Uncacheable is good when accessing hardware registers. When should a page be set to write-through? What are the advantages to that?
  3. Are there any write-through caches (if it really is a property of the hardware and not just something that is controlled by the pagetable attributes) or is the trend that all caches are created as write-back to reduce traffic?

推荐答案

每个高速缓存(L1,L2,LLC)可以以写回或直写模式工作吗?

Can every cache (L1, L2, LLC) work in either write-back or write-through mode?

是的,在大多数x86微体系结构中,所有数据/统一高速缓存都(能够)回写,并以该模式用于所有普通DRAM. 在Intel Core i7中使用了哪种缓存映射技术处理器?包含一些详细信息和链接.除非另有说明,否则任何谈论x86的人都默认假设DRAM页将是WB.

In most x86 microarchitectures, yes, all the data / unified caches are (capable of) write-back and used in that mode for all normal DRAM. Which cache mapping technique is used in intel core i7 processor? has some details and links. Unless otherwise specified, the default assumption by anyone talking about x86 is that DRAM pages will be WB.

AMD Bulldozer做出了非常规的选择,即使用直写式L1d和它与L2之间的4k写合并缓冲区.(

AMD Bulldozer made the unconventional choice to use write-through L1d with a small 4k write-combining buffer between it and L2. (https://www.realworldtech.com/bulldozer/8/). This has many disadvantages and is I think widely regarded (in hindsight) as one of several weaknesses or even design mistakes of Bulldozer-family (which AMD fixed for Zen). Note also that Bulldozer was an experiment in CMT instead of SMT (two weak integer cores sharing an FPU/SIMD unit, each with separate L1d caches sharing an L2 cache) https://www.realworldtech.com/bulldozer/3/ shows the system architecture.

但是,当然,推土机L2和L3高速缓存仍然是WB,建筑师并没有发疯. WB缓存对于减少共享LLC和内存的带宽需求至关重要.甚至直写L1d都需要一个写合并缓冲区,以允许L2高速缓存更大和更慢,因此达到了有时在L1d丢失时命中的目的.另请参见

But of course Bulldozer L2 and L3 caches were still WB, the architects weren't insane. WB caching is essential to reduce bandwidth demands for shared LLC and memory. And even the write-through L1d needed a write-combining buffer to allow L2 cache to be larger and slower, thus serving its purpose of sometimes hitting when L1d misses. See also Why is the size of L1 cache smaller than that of the L2 cache in most of the processors?

直写式缓存可以简化设计(尤其是单核系统),但是通常CPU早于几十年前就已普及. (回写与直写式缓存?).在IIRC中,某些非CPU工作负载有时会受益于直写式缓存,尤其是没有写分配时,因此写操作不会污染缓存. x86拥有NT存储区来避免该问题.

Write-through caching can simplify a design (especially of a single-core system), but generally CPUs moved beyond that decades ago. (Write-back vs Write-Through caching?). IIRC, some non-CPU workloads sometimes benefit from write-through caching, especially without write-allocate so writes don't pollute cache. x86 has NT stores to avoid that problem.

因此,如果将页面属性设置为直写,那么它们全部都将是直写的?

So if the page attribute is set to write-through, then they all will be write-through?

是的,每个商店都必须在标记为WT的页面中一直到DRAM.

Yes, every store has to go all the way to DRAM in a page that's marked WT.

缓存是针对WB优化的,因为这是每个人都使用的方法,但是显然确实支持将行传递到外部缓存,而不必退出L1d. (因此WT不会将存储变成类似movntps绕过/驱逐存储的东西.)

The caches are optimized for WB because that's what everyone uses, but apparently do support passing on the line to outer caches without evicting from L1d. (So WT doesn't turn stores into something like movntps cache-bypassing / evicting stores.)

何时应将页面设置为直写?这样做有什么好处?

When should a page be set to write-through? What are the advantages to that?

基本上从不; (几乎?)所有CPU工作负载都最适合WB内存.

操作系统甚至不费吹灰之力使用户空间分配WC或WT DRAM页面变得容易(或可能?). (尽管那肯定不能证明它们从不有用.)在 CPU缓存禁止上,我发现了

OSes don't even bother to make it easy (or possible?) for user-space to allocate WC or WT DRAM pages. (Although that certainly doesn't prove they're never useful.) e.g. on CPU cache inhibition, I found a link about a Linux patch that never made it into the mainline kernel that added the possibility of mapping a page WT.

WB,WC和UC分别是普通DRAM,设备内存(尤其是GPU)和MMIO的常用

WB, WC, and UC are common for normal DRAM, device memory (especially GPU), and MMIO respectively.

我已经看到至少有一篇论文以WT,WB,UC,WC为基准对某些工作负载进行了基准测试(谷歌搜索,但没有找到,抱歉).而且人们测试晦涩的x86内容有时会出于完整性考虑将其包括在内.例如崩溃之后的微体系结构总体上是一篇不错的文章(并且与您正在阅读).

I have seen at least one paper that benchmarked WT vs. WB vs. UC vs. WC for some workload (googled but didn't find it, sorry). And people testing obscure x86 stuff will sometimes include it for completeness. e.g. The Microarchitecture Behind Meltdown is a good article in general (and related to what you're reading up on).

WT的几个优点之一是,商店会迅速停在L3上,其他核心的负载可能会受到冲击.这可能值得为该页面的每个商店支付额外的费用,尤其是在您要小心地将写入的内容手动合并到一个大型的32字节AVX存储中. (或64字节AVX512全行写入.)当然,仅将该页面用于共享数据.

One of the few advantages of WT is that stores end up in L3 promptly where loads from other cores can hit. This may possibly be worth the extra cost for every store to that page, especially if you're careful to manually combine your writes into one large 32-byte AVX store. (Or 64-byte AVX512 full-line write.) And of course only use that page for shared data.

不过,我还没有看到有人推荐这样做,但这不是我尝试过的事情.可能是因为通过L3进行写操作所需要的额外DRAM带宽在大多数用例中也不值得.但也可能是因为您可能必须编写内核模块才能以这种方式映射页面.

I haven't seen anyone ever recommend doing this, though, and it's not something I've tried. Probably because the extra DRAM bandwidth for writing through L3 as well isn't worth the benefit for most use-cases. But probably also because you might have to write a kernel module to get a page mapped that way.

这篇关于对页面使用直写式高速缓存策略时的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆