什么特别将 x86 缓存行标记为脏 - 任何写入,或者是否需要显式更改? [英] What specifically marks an x86 cache line as dirty - any write, or is an explicit change required?

查看:18
本文介绍了什么特别将 x86 缓存行标记为脏 - 任何写入,或者是否需要显式更改?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个问题专门针对现代 x86-64 缓存一致性架构 - 我很欣赏其他 CPU 上的答案可能会有所不同.

This question is specifically aimed at modern x86-64 cache coherent architectures - I appreciate the answer can be different on other CPUs.

如果我写入内存,MESI 协议要求先将缓存行读入缓存,然后在缓存中修改(将值写入缓存行,然后将其标记为脏).在较旧的 write-though 微架构中,这将触发缓存线被刷新,在回写下,被刷新的缓存线可能会延迟一段时间,并且在两种机制下都可能发生一些写组合(更可能是写回).而且我知道这如何与访问相同数据缓存行的其他内核交互 - 缓存监听等.

If I write to memory, the MESI protocol requires that the cache line is first read into cache, then modified in the cache (the value is written to the cache line which is then marked dirty). In older write-though micro-architectures, this would then trigger the cache line being flushed, under write-back the cache line being flushed can be delayed for some time, and some write combining can occur under both mechanisms (more likely with writeback). And I know how this interacts with other cores accessing the same cache-line of data - cache snooping etc.

我的问题是,如果存储与缓存中已有的值精确匹配,如果没有一个位被翻转,任何英特尔微架构是否注意到这一点并且NOT将该行标记为脏,从而可能避免该行被标记为独占,以及随后会出现的回写内存开销?

My question is, if the store matches precisely the value already in the cache, if not a single bit is flipped, does any Intel micro-architecture notice this and NOT mark the line as dirty, and thereby possibly save the line from being marked as exclusive, and the writeback memory overhead that would at some point follow?

当我对更多循环进行矢量化时,我的矢量化操作组合原语不会明确检查值的变化,在 CPU/ALU 中这样做似乎很浪费,但我想知道底层缓存电路是否可以做到这一点没有显式编码(例如存储微操作或缓存逻辑本身).随着跨多个内核的共享内存带宽变得越来越成为资源瓶颈,这似乎是一种越来越有用的优化(例如,对同一内存缓冲区重复清零——如果它们已经存在,我们不会从 RAM 中重新读取这些值)在缓存中,但强制写回相同的值似乎很浪费).写回缓存本身就是对此类问题的承认.

As I vectorise more of my loops, my vectorised-operations compositional primitives don't explicitly check for values changing, and to do so in the CPU/ALU seems wasteful, but I was wondering if the underlying cache circuitry could do it without explicit coding (eg the store micro-op or the cache logic itself). As shared memory bandwidth across multiple cores becomes more of a resource bottleneck, this would seem like an increasingly useful optimisation (eg repeated zero-ing of the same memory buffer - we don't re-read the values from RAM if they're already in cache, but to force a writeback of the same values seems wasteful). Writeback caching is itself an acknowledgement of this sort of issue.

我可以礼貌地要求保留理论上"吗?或这真的没关系"答案 - 我知道内存模型是如何工作的,我正在寻找的是关于写入相同值(而不是避免存储)将如何影响您可以安全地假设是机器的内存总线争用的硬事实运行几乎总是受内存带宽限制的多个工作负载.另一方面,解释芯片不这样做的确切原因(我悲观地假设他们不这样做)会很有启发性......

Can I politely request holding back on "in theory" or "it really doesn't matter" answers - I know how the memory model works, what I'm looking for is hard facts about how writing the same value (as opposed to avoiding a store) will affect the contention for the memory bus on what you may safely assume is a machine running multiple workloads that are nearly always bound by memory bandwidth. On the other hand an explanation of precise reasons why chips don't do this (I'm pessimistically assuming they don't) would be enlightening...

更新: 这里有一些符合预期的答案https://softwareengineering.stackexchange.com/questions/302705/are-there-cpus-that-perform-this-possible-l1-cache-write-optimization 但仍然有很多猜测这一定很难,因为它还没有完成";并说在主 CPU 内核中这样做会很昂贵(但我仍然想知道为什么它不能成为实际缓存逻辑本身的一部分).

Update: Some answers along the expected lines here https://softwareengineering.stackexchange.com/questions/302705/are-there-cpus-that-perform-this-possible-l1-cache-write-optimization but still an awful lot of speculation "it must be hard because it isn't done" and saying how doing this in the main CPU core would be expensive (but I still wonder why it can't be a part of the actual cache logic itself).

更新(2020 年): Travis Downs 找到了硬件商店消除的证据,但似乎仅适用于零且仅在数据未命中 L1 和 L2 的情况下,即使如此,也并非在所有情况下.强烈推荐他的文章,因为它更详细......https://travisdowns.github.io/blog/2020/05/13/intel-zero-opt.html

Update (2020): Travis Downs has found evidence of Hardware Store Elimination but only, it seems, for zeros and only where the data misses L1 and L2, and even then, not in all cases. His article is highly recommended as it goes into much more detail.... https://travisdowns.github.io/blog/2020/05/13/intel-zero-opt.html

更新(2021 年): Travis Downs 现在发现了证据表明最近在微代码中禁用了这种零商店优化......更多细节来自源头本人https://travisdowns.github.io/blog/2021/06/17/rip-zero-opt.html

Update (2021): Travis Downs has now found evidence that this zero store optimisation has recently been disabled in microcode... more detail as ever from the source himself https://travisdowns.github.io/blog/2021/06/17/rip-zero-opt.html

推荐答案

目前没有实现的 x86(或任何其他 ISA,据我所知)支持优化静默存储.

Currently no implementation of x86 (or any other ISA, as far as I know) supports optimizing silent stores.

对此已有学术研究,甚至还有一项消除共享内存缓存一致性协议中的静默存储失效传播"的专利.(谷歌搜索 '"silent store" cache' 如果你是对更多感兴趣.)

There has been academic research on this and there is even a patent on "eliminating silent store invalidation propagation in shared memory cache coherency protocols". (Googling '"silent store" cache' if you are interested in more.)

对于 x86,这会干扰 MONITOR/MWAIT;一些用户可能希望监控线程在静默存储中唤醒(可以避免失效并添加触摸"一致性消息).(目前 MONITOR/MWAIT 具有特权,但未来可能会发生变化.)

For x86, this would interfere with MONITOR/MWAIT; some users might want the monitoring thread to wake on a silent store (one could avoid invalidation and add a "touched" coherence message). (Currently MONITOR/MWAIT is privileged, but that might change in the future.)

同样,这可能会干扰事务性内存的一些巧妙使用.如果将内存位置用作保护以避免显式加载其他内存位置,或者在支持此类的架构中(例如在 AMD 的高级同步工具中),则从读取集中删除受保护的内存位置.

Similarly, such could interfere with some clever uses of transactional memory. If the memory location is used as a guard to avoid explicit loading of other memory locations or, in an architecture that supports such (such was in AMD's Advanced Synchronization Facility), dropping the guarded memory locations from the read set.

(Hardware Lock Elision 是一种非常受约束的无声 ABA 存储消除实现.它的实现优势是显式请求值一致性检查.)

(Hardware Lock Elision is a very constrained implementation of silent ABA store elimination. It has the implementation advantage that the check for value consistency is explicitly requested.)

在性能影响/设计复杂性方面也存在实施问题.这将禁止避免为所有权读取(除非仅当缓存行已经存在于共享状态时才会激活静默存储消除),尽管当前也没有实现为所有权读取.

There are also implementation issues in terms of performance impact/design complexity. Such would prohibit avoiding read-for-ownership (unless the silent store elimination was only active when the cache line was already present in shared state), though read-for-ownership avoidance is also currently not implemented.

对静默存储的特殊处理也会使内存一致性模型的实现复杂化(可能尤其是 x86 的相对强大的模型).这也可能会增加对一致性失败的推测的回滚频率.如果只支持 L1-present 线路的静默存储,时间窗口将非常小并且回滚极其;存储到 L3 或内存中的缓存行可能会将频率增加到非常罕见,这可能会使其成为一个明显的问题.

Special handling for silent stores would also complicate implementation of a memory consistency model (probably especially x86's relatively strong model). Such might also increase the frequency of rollbacks on speculation that failed consistency. If silent stores were only supported for L1-present lines, the time window would be very small and rollbacks extremely rare; stores to cache lines in L3 or memory might increase the frequency to very rare, which might make it a noticeable issue.

缓存行粒度的静默也比访问级别的静默少见,因此避免的无效次数会更少.

Silence at cache line granularity is also less common than silence at the access level, so the number of invalidations avoided would be smaller.

额外的缓存带宽也是一个问题.目前,英特尔仅在 L1 缓存上使用奇偶校验,以避免对小写操作进行读-修改-写.要求每次写入都进行读取以检测静默存储会对性能和功耗产生明显影响.(这样的读可能仅限于共享缓存行并机会性地执行,利用没有完全缓存访问利用率的周期,但这仍然会产生电力成本.)这也意味着如果已经存在读取-修改-写入支持,则该成本将下降L1 ECC 支持(哪些功能会让一些用户满意).

The additional cache bandwidth would also be an issue. Currently Intel uses parity only on L1 caches to avoid the need for read-modify-write on small writes. Requiring every write to have a read in order to detect silent stores would have obvious performance and power implications. (Such reads could be limited to shared cache lines and be performed opportunistically, exploiting cycles without full cache access utilization, but that would still have a power cost.) This also means that this cost would fall out if read-modify-write support was already present for L1 ECC support (which feature would please some users).

我不太了解无提示商店消除,所以可能还有其他问题(和解决方法).

I am not well-read on silent store elimination, so there are probably other issues (and workarounds).

随着性能改进的许多低垂的果实已经被采用,更困难、更不有益且不那么普遍的优化变得更具吸引力.由于无声商店优化随着内核间通信的增加变得更加重要,并且随着使用更多内核来处理单个任务,内核间通信也会增加,因此其价值似乎可能会增加.

With much of the low-hanging fruit for performance improvement having been taken, more difficult, less beneficial, and less general optimizations become more attractive. Since silent store optimization becomes more important with higher inter-core communication and inter-core communication will increase as more cores are utilized to work on a single task, the value of such seems likely to increase.

这篇关于什么特别将 x86 缓存行标记为脏 - 任何写入,或者是否需要显式更改?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆