是什么专门将x86高速缓存行标记为脏-进行任何写操作,还是需要进行显式更改? [英] What specifically marks an x86 cache line as dirty - any write, or is an explicit change required?

查看:148
本文介绍了是什么专门将x86高速缓存行标记为脏-进行任何写操作,还是需要进行显式更改?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个问题专门针对,是针对现代x86-64高速缓存一致性体系结构的,我很高兴在其他CPU上得出不同的答案。

This question is specifically aimed at modern x86-64 cache coherent architectures - I appreciate the answer can be different on other CPUs.

如果我写入内存,则MESI协议要求先将高速缓存行读入高速缓存,然后在高速缓存中进行修改(将值写入高速缓存行,然后将其标记为脏)。在较旧的可写微体系结构中,这将触发缓存行被刷新,在回写情况下,被刷新的缓存行可能会延迟一段时间,并且在两种机制下都可能发生某些写合并(更可能发生回写) 。而且我知道这与访问相同数据缓存行的其他内核如何交互(缓存侦听等)。

If I write to memory, the MESI protocol requires that the cache line is first read into cache, then modified in the cache (the value is written to the cache line which is then marked dirty). In older write-though micro-architectures, this would then trigger the cache line being flushed, under write-back the cache line being flushed can be delayed for some time, and some write combining can occur under both mechanisms (more likely with writeback). And I know how this interacts with other cores accessing the same cache-line of data - cache snooping etc.

我的问题是,商店是否与已存在的值完全匹配高速缓存(如果不是一个位被翻转)是否有任何英特尔微体系结构注意到这一点,并且将行标记为脏线,从而可能保存该行以免被标记为互斥,以及回写内存

My question is, if the store matches precisely the value already in the cache, if not a single bit is flipped, does any Intel micro-architecture notice this and NOT mark the line as dirty, and thereby possibly save the line from being marked as exclusive, and the writeback memory overhead that would at some point follow?

当我向量化更多的循环时,我的向量化运算组合原语不会显式检查值的更改,而这样做CPU / ALU中的数据似乎是浪费的,但是我想知道底层的缓存电路是否可以在没有显式编码的情况下做到这一点(例如存储微操作或缓存逻辑本身)。随着跨多个内核的共享内存带宽越来越成为资源瓶颈,这似乎是一种越来越有用的优化(例如,对同一内存缓冲区重复置零-如果已经存在,我们就不会从RAM中重新读取它们的值)在缓存中,但是强制回写相同的值似乎很浪费)。写回缓存本身就是对此类问题的认可。

As I vectorise more of my loops, my vectorised-operations compositional primitives don't explicitly check for values changing, and to do so in the CPU/ALU seems wasteful, but I was wondering if the underlying cache circuitry could do it without explicit coding (eg the store micro-op or the cache logic itself). As shared memory bandwidth across multiple cores becomes more of a resource bottleneck, this would seem like an increasingly useful optimisation (eg repeated zero-ing of the same memory buffer - we don't re-read the values from RAM if they're already in cache, but to force a writeback of the same values seems wasteful). Writeback caching is itself an acknowledgement of this sort of issue.

我可以礼貌地要求保留理论上还是这真的没关系的答案-我知道内存模型是如何工作的,我要寻找的是关于写入相同值(而不是避免存储)将如何影响内存总线争用的硬事实,您可以安全地假设这是一台运行多个工作负载的机器几乎总是受内存带宽限制。另一方面,解释为什么芯片不执行此操作的确切原因(我很悲观地认为它们没有执行)将会很有启发性。

Can I politely request holding back on "in theory" or "it really doesn't matter" answers - I know how the memory model works, what I'm looking for is hard facts about how writing the same value (as opposed to avoiding a store) will affect the contention for the memory bus on what you may safely assume is a machine running multiple workloads that are nearly always bound by memory bandwidth. On the other hand an explanation of precise reasons why chips don't do this (I'm pessimistically assuming they don't) would be enlightening...

更新: 此处的预期答案有些 https://softwareengineering.stackexchange.com/questions/302705/are-there-cpus-that-perform-this-possible-l1-cache-write-optimization ,但仍然大量的猜测它必须很难,因为它没有完成,并说在主CPU内核中这样做会很昂贵(但是我仍然想知道为什么它不能成为实际缓存逻辑本身的一部分)

Update: Some answers along the expected lines here https://softwareengineering.stackexchange.com/questions/302705/are-there-cpus-that-perform-this-possible-l1-cache-write-optimization but still an awful lot of speculation "it must be hard because it isn't done" and saying how doing this in the main CPU core would be expensive (but I still wonder why it can't be a part of the actual cache logic itself).

更新(2020):特拉维斯·唐斯(Travis Downs)发现了消除五金店的证据,但似乎只有零和仅在数据未命中L1和L2的地方,即使如此,并非在所有情况下都如此。
强烈推荐他的文章,因为它会涉及更多细节....
https://travisdowns.github.io/blog/2020/05/13/intel-zero-opt.html

Update (2020): Travis Downs has found evidence of Hardware Store Elimination but only, it seems, for zeros and only where the data misses L1 and L2, and even then, not in all cases. His article is highly recommended as it goes into much more detail.... https://travisdowns.github.io/blog/2020/05/13/intel-zero-opt.html

推荐答案

当前x86(据我所知,其他任何ISA)的实现均支持优化静默存储。

Currently no implementation of x86 (or any other ISA, as far as I know) supports optimizing silent stores.

对此已进行了学术研究,甚至还有一项专利,涉及消除共享内存缓存一致性协议中的静默存储无效传播。 (如果您使用的是Google搜索 silent store缓存

There has been academic research on this and there is even a patent on "eliminating silent store invalidation propagation in shared memory cache coherency protocols". (Googling '"silent store" cache' if you are interested in more.)

对于x86,这会干扰MONITOR / MWAIT;一些用户可能希望监视线程在静默存储中唤醒(一个人可以避免失效并添加已触摸一致性消息)。 (当前,MONITOR / MWAIT是特权,但是将来可能会改变。)

For x86, this would interfere with MONITOR/MWAIT; some users might want the monitoring thread to wake on a silent store (one could avoid invalidation and add a "touched" coherence message). (Currently MONITOR/MWAIT is privileged, but that might change in the future.)

类似地,这可能会干扰对事务内存的某些巧妙使用。如果将内存位置用作保护措施以避免显式加载其他内存位置,或者在支持这种内存的架构中(例如AMD的Advanced Synchronization Facility),则从读取集中删除受保护的内存位置。

Similarly, such could interfere with some clever uses of transactional memory. If the memory location is used as a guard to avoid explicit loading of other memory locations or, in an architecture that supports such (such was in AMD's Advanced Synchronization Facility), dropping the guarded memory locations from the read set.

(硬件锁定清除是无限制ABA存储消除的非常受限制的实现。它具有实现优势,即明确要求检查值一致性。)

(Hardware Lock Elision is a very constrained implementation of silent ABA store elimination. It has the implementation advantage that the check for value consistency is explicitly requested.)

在性能影响/设计复杂性方面也存在实施问题。这样会禁止避免所有权的读取(除非只有当缓存行已经处于共享状态时,静默存储消除才处于活动状态),尽管目前还没有实现避免所有权的读取。

There are also implementation issues in terms of performance impact/design complexity. Such would prohibit avoiding read-for-ownership (unless the silent store elimination was only active when the cache line was already present in shared state), though read-for-ownership avoidance is also currently not implemented.

对静默存储的特殊处理也会使内存一致性模型(可能尤其是x86相对强大的模型)的实现复杂化。这样可能还会增加由于一致性失败而进行的推测的回滚频率。如果仅L1当前行支持静默存储,则时间窗口将非常小,回滚极其很少;存储到L3或内存中的缓存行的频率可能会增加到非常罕见的水平,这可能是一个明显的问题。

Special handling for silent stores would also complicate implementation of a memory consistency model (probably especially x86's relatively strong model). Such might also increase the frequency of rollbacks on speculation that failed consistency. If silent stores were only supported for L1-present lines, the time window would be very small and rollbacks extremely rare; stores to cache lines in L3 or memory might increase the frequency to very rare, which might make it a noticeable issue.

对缓存行粒度的沉默也比沉默更不常见。在访问级别,因此避免的无效次数会更少。

Silence at cache line granularity is also less common than silence at the access level, so the number of invalidations avoided would be smaller.

额外的缓存带宽也是一个问题。当前,英特尔仅在L1高速缓存上使用奇偶校验,以避免对小写操作进行读-修改-写操作。要求每次写入以进行读取以检测静默存储,这将对性能和功耗产生明显影响。 (这样的读取
可能仅限于共享的缓存行,并且是机会性地执行的,利用了没有完全利用缓存访问的循环,但这仍然会产生电力成本。)这也意味着如果进行读取-修改,则该成本将下降。 L1 ECC支持已经提供了写支持(该功能将使某些用户满意)。

The additional cache bandwidth would also be an issue. Currently Intel uses parity only on L1 caches to avoid the need for read-modify-write on small writes. Requiring every write to have a read in order to detect silent stores would have obvious performance and power implications. (Such reads could be limited to shared cache lines and be performed opportunistically, exploiting cycles without full cache access utilization, but that would still have a power cost.) This also means that this cost would fall out if read-modify-write support was already present for L1 ECC support (which feature would please some users).

我对消除静默存储的方法不甚了解,因此可能还有其他原因问题(和解决方法)。

I am not well-read on silent store elimination, so there are probably other issues (and workarounds).

由于采取了许多提高绩效的低调策略,因此,更困难,更不利和更不普遍的优化变得更具吸引力。由于静默存储优化随着更高的内核间通信变得越来越重要,并且随着更多的内核被用于完成一项任务,内核间的通信也将增加,因此这种价值似乎会增加。

With much of the low-hanging fruit for performance improvement having been taken, more difficult, less beneficial, and less general optimizations become more attractive. Since silent store optimization becomes more important with higher inter-core communication and inter-core communication will increase as more cores are utilized to work on a single task, the value of such seems likely to increase.

这篇关于是什么专门将x86高速缓存行标记为脏-进行任何写操作,还是需要进行显式更改?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆