英特尔非临时存储到同一高速缓存行的排序 [英] Ordering of Intel non-temporal stores to the same cache line

查看:67
本文介绍了英特尔非临时存储到同一高速缓存行的排序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

由同一线程发出的非临时存储(例如movnti)是否到达同一高速缓存行(由同一线程发出)?因此,对于具有NVRAM的系统(例如具有Intel 3D XPoint NVRAM的Intel Cascade Lake处理器),在发生崩溃的情况下,缺少重新排序可确保写入同一缓存行的前缀占上风?

解决方案

假定非临时存储的已解析内存类型是WC(或WC +),这就是我想问的问题,答案是大多不在Intel和AMD处理器上.

对于Intel处理器,Intel SDM V2第11.3.1节中的某些语句指定了在具有至少一个WC缓冲区的微体系结构上进行写梳理写入的行为.

驱逐WC缓冲区的协议取决于实现并且不应依靠软件来实现系统内存一致性.

这是一条一般性声明,它说WC逐出的原因和为驱逐WC缓冲区而执行的事务与实现有关.但是手册的不同地方有一些具体说明.

与[p6一样],对于以这些处理器开头的较新的处理器基于Intel NetBurst微体系结构,完整的WC缓冲区将总是使用任何块作为单个突发事务传播交易中的订单.

如果同一WC缓冲区中的所有字节均有效,则意味着自分配缓冲区以来每个字节至少写入一次,如果出于任何原因将缓冲区移出,则使用以下方法移出缓冲区中的整个高速缓存行:一次交易.如果缓冲区的目标是内存控制器(这是CLX持久性域中的第一个单元),则将保留事务的所有字节,或者不保留任何字节.这意味着将保持已写入该行的写指令的程序顺序.这些特定写入与其他写入之间的顺序将在后面讨论.

在交易中使用任何大块订单"当事务的目标是内存控制器时,从软件的角度来看,这部分并不重要,而对于其他目标则很重要.

Intel已将块大小指定为在所有微体系结构上对齐8个字节.此块大小仅适用于核心和非核心互连,但不适用于实现其他协议的范围.但是对于以IMC为目标的写入,在事务的粒度上可以保证持久性原子性,事务的粒度可以从1到64个字节不等(所有现代Intel和AMD处理器上的WC缓冲区的大小为64个字节),具体取决于逐出缓冲区时,并根据确切的逐出协议,在同一WC缓冲区内有效字节的分布.在Intel处理器上,在完全清除WC缓冲区的情况下,保证事务包含全部64个有效字节.

AMD手册仅说,完整的WC缓冲区驱逐 可以作为单个事务执行.

以下引号指定了在部分WC缓冲区逐出(其中并非所有字节在缓冲区中都标记为有效)的情况下的排序保证,以及在不同WC缓冲区中的写操作之间的排序.它适用于Intel和AMD处理器.

一旦开始逐出WC缓冲区,数据将受到限制其定义的弱排序语义.

该段的其余部分继续进行阐述.可以使用一个或多个事务清除部分WC缓冲区,并且这些事务之间没有排序保证.一旦将写指令提交到WC缓冲区,它在程序顺序中的位置将完全丢失.如果这些事务的目标是IMC,则仅以单个事务的粒度提供持久性原子性.这就是使用有效内存类型WC的写操作可以保持的方式,而无需持久保留更早的WC写操作.如果不同的写指令在同一WC缓冲区中部分重叠,则相对于同一WC缓冲区中的其他写操作,一条写指令可能会变得部分持久.WC缓冲区中跨越块边界的写入操作在结构上不能保证是原子的,除非合并写入后(在Intel处理器上)缓冲区完全满了.

可以按与缓冲区分配顺序不同的顺序逐出WC缓冲区.栅栏指令不能用于有选择地刷新WC缓冲区.但是,除WC以外的任何类型的写入(其中有重叠分配的WC缓冲区)都会导致该缓冲区特别在执行写入之前被逐出.撞到WCB的负载可能不会导致缓冲区被逐出.

刷新单个WC缓冲区时发生的事务不必相对于刷新同一物理核心中的另一个WC缓冲区时发生的事务进行排序.即使实现了WC逐出逻辑,使得WC缓冲区被连续逐出(这很可能),也无法保证来自不同WC缓冲区的事务不会最终在物理核心域之外被交错.

这一切都意味着,即使在同一物理内核中,也不能保证同一WC缓冲区和不同WC缓冲区的不同块之间的持久排序.

在供应商和同一供应商的处理器之间,导致WC缓冲区被收回的事件可能有所不同.一些事件是体系结构的(在开发人员手册中记录),而另一些事件是特定于实现的(在数据表中记录).存储序列化指令是同步事件的一个示例,该事件确实保证刷新同一逻辑内核上的所有WC缓冲区.传递给逻辑核心的硬件中断是异步事件的一个示例,该事件还导致其所有WC缓冲区被驱逐.此外,每个物理或逻辑核心的WC缓冲区的数量取决于实现方式,并且可以为零.WC缓冲区的大小也取决于实现,从结构上讲,它可以大于或小于L1D高速缓存行的大小.WC缓冲区还可以用于多种用途,而不是结合WC写操作,具体取决于微体系结构.

因此,即使您只写完整的WC缓冲区,也无法确保仅在为持久原子性而变满WC缓冲区时才逐出WC缓冲区,即使是在使用完全WC驱逐功能的Intel处理器上也是如此.一次交易.

您可以使用 MOVDIR64B 来保证原子性,而不是执行多个WC写指令. MOVDIR64B 不分配WC缓冲区,而是直接到达目的地,但可以将其与已经分配的WC缓冲区结合,在这种情况下,将结合缓冲区的现有内容后立即将其逐出和 MOVDIR64B .无论如何, MOVDIR64B 的写操作始终作为单个事务执行.请注意, MOVDIR64B 的目标内存操作数必须在64字节边界上对齐.与传统的WC商店类似, MOVDIR64B 与UC以外的其他任何商店的排序都不强.TNT,TGL和SPR支持 MOVDIR64B .

除Intel和AMD处理器上的UC以外,对于任何其他存储器类型的其他写入,均不对WC/WC +写入进行排序.此外,跨越对齐的8字节边界的任何内存类型的单个写指令(或写入物理内存地址空间的指令)本身不能保证以原子性超出对齐的8字节的粒度.这包括持久性原子性.唯一的例外是 MOVDIR64B ENQCMD ENQCMDS .最后两个与执行MMIO写操作有关.对齐的64字节AVX-512存储可能会持久地保持原子状态,但这在体系结构上不能保证,因此不应依赖.

Do non-temporal stores (such as movnti), to the same cache line, issued by the same thread, reach the memory in program order? So that for a system with NVRAM (like Intel Cascade Lake processor with an Intel 3D XPoint NVRAM), in case of a crash, the lack of reordering guarantees that a prefix of the writes to the same cache line prevails?

解决方案

Assuming that the resolved memory type of the non-temporal stores is WC (or WC+), which is what I think you're asking about, the answer is mostly not on Intel and AMD processors.

For Intel processors, certain statements from Section 11.3.1 of the Intel SDM V2 specify the behavior of write-combing writes on microarchitecturs with at least one WC buffer.

The protocol for evicting the WC buffers is implementation dependent and should not be relied on by software for system memory coherency.

This is a general statement that says that the causes of WC evictions and transactions performed for evicting a WC buffer are implementation-dependent. But there are specific statements in different places in the manual.

Likewise [like on P6], for more recent processors starting with those based on Intel NetBurst microarchitectures, a full WC buffer will always be propagated as a single burst transactions, using any chunk order within a transaction.

If all the bytes in the same WC buffer are valid, meaning that each byte was written to at least once since the buffer was allocated, when the buffer is evicted for any reason, the entire cache line in the buffer is evicted using a single transaction. If the target of the buffer is a memory controller, which is the the first unit in the persistence domain on CLX, either all the bytes of the transaction are persisted or none of the bytes. This implies that the program order of write instructions that have written into that line is maintained. The ordering between these particular writes and other writes will be discussed later.

The "using any chunk order within a transaction" part in this context is not important from the perspective of software when the target of the transaction is a memory controller, but is important for other targets.

Intel has specified the chunk size to be aligned 8 bytes on all microarchitectures. This chunk size only applies on the core and uncore interconnects, but not beyond that where other protocols are implemented. But with respect to writes targeting an IMC, persist atomicity is guaranteed at the granularity of a transaction, which may contain anywhere from 1 to 64 bytes (the size of a WC buffer on all modern Intel and AMD processors is 64 bytes), depending on the distribution of valid bytes within the same WC buffer at the time when the buffer got evicted and depending on the exact eviction protocol. On Intel processors, the transaction is guaranteed to contain all of the 64 valid bytes in case of a full WC buffer eviction.

The AMD manual only says that full a WC buffer eviction can be performed as a single transaction.

The following quote specifies ordering guarantees in the case partial WC buffer evictions (where not all bytes are marked as valid in the buffer) and ordering between writes in different WC buffers. It applies to Intel and AMD processors.

Once the eviction of a WC buffer has started, the data is subject to the weak ordering semantics of its definition.

The rest of the paragraph proceeds to elaborate. A partial WC buffer can be evicted using one or more transactions and there is no ordering guarantees between these transactions. Once a write instruction is committed to a WC buffer, it's location in program order is completely lost. If the target of these transactions is an IMC, persist atomicity is only provided at the granularity of a single transaction. That's how a write with effective memory type of WC can persist without persisting an earlier WC write. If different write instructions partially overlap within the same WC buffer, a write instruction can become partially persistent out of order with respect to other writes in the same WC buffer. A write operation in a WC buffer that crosses a chunk boundary is not architecturally guaranteed to be atomic, unless the buffer is entirely full after combining the write (on Intel processors).

WC buffers can be evicted in an order that is different from the buffer allocation order. Fence instructions cannot be used to selectively flush WC buffers. However, a write of any type other than WC where there is an overlapping allocated WC buffer causes that buffer in particular to be evicted before performing the write. A load that hits in a WCB may not cause the buffer to be evicted.

The transactions that occur to flush a single WC buffer are not necessarily ordered with respect to the transactions that occur to flush another WC buffer in the same physical core. Even if WC eviction logic is implemented such that WC buffers are evicted in serially, which is likely, there is no guarantee that transactions from different WC buffers won't end up being interleaved outside the physical core domain.

This all means that persist ordering is not guaranteed between different chunks of the same WC buffer and of different WC buffers, even in the same physical core.

The events that cause a WC buffer to be evicted may differ between vendors and processors from the same vendor. Some events are architectural (documented in the developer manuals) while others are implementation-specific (documented in the datasheets). Store serializing instructions are an example of a synchronous event that does guarantee flushing all WC buffers on the same logical core. A hardware interrupt delivered to a logical core is an example of an asynchronous event that also causes all of its WC buffers to be evicted. Moreover, the number of WC buffers per physical or logical core is implementation-dependent and could be zero. The size of a WC buffer is also implementation-dependent and could be, architecturally speaking, larger or smaller than the size of an L1D cache line. Also WC buffers could be used for multiple purposes other than combining WC writes, depending on the microarchitecure.

Therefore, even if you're only writing full WC buffers, it's impossible to ensure that a WC buffer is only evicted when it becomes full for the purpose of persist atomicity, even on Intel processors where a full WC eviction is performed using a single transaction.

Instead of performing multiple WC write instructions, you can use MOVDIR64B, which guarantees atomicity. MOVDIR64B doesn't allocate a WC buffer and goes directly to the destination, but it may be combined with an already allocated WC buffer, in which case the buffer is evicted immediately after combining the existing contents of the buffer and MOVDIR64B. In any case, the write operation of MOVDIR64B is always performed as a single transaction. Note that the destination memory operand of MOVDIR64B is required to be aligned on a 64-byte boundary. Similar to a traditional WC store, MOVDIR64B is weakly-ordered with any other store, except UC. MOVDIR64B is supported on TNT, TGL, and SPR.

A WC/WC+ write is not ordered with respect to other writes of any memory type except UC on Intel and AMD processors. In addition, a single write instruction (or an instruction that writes to the physical memory address space) of any memory type that crosses an aligned 8-byte boundary is itself not guaranteed to be atomic at a granularity beyond aligned 8-bytes. This includes persist atomicity. The only exceptions are MOVDIR64B, ENQCMD, and ENQCMDS. The last two are relevant when doing MMIO writes. Aligned 64-byte AVX-512 stores are likely to be persistently atomic, but this is not architecturally guaranteed and should not be relied upon.

这篇关于英特尔非临时存储到同一高速缓存行的排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆