(Persistence) 将 Intel 非临时存储排序到同一高速缓存行 [英] (Persistence) ordering of Intel non-temporal stores to the same cache line

查看：26 发布时间：2021/9/29 19:36:57 x86 cpu-architecture memory-barriers persistent-memory

本文介绍了(Persistence) 将 Intel 非临时存储排序到同一高速缓存行的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

非临时存储(例如movnti)，到同一线程发出的同一缓存行，是否按程序顺序到达内存?

Do non-temporal stores (such as movnti), to the same cache line, issued by the same thread, reach the memory in program order?

因此，对于具有 NVRAM 的系统(例如具有英特尔 3D XPoint NVRAM 的英特尔 Cascade Lake 处理器)，在发生崩溃时，缺少重新排序保证了写入的前缀相同的缓存行占优势?

So that for a system with NVRAM (like Intel Cascade Lake processor with an Intel 3D XPoint NVRAM), in case of a crash, the lack of reordering guarantees that a prefix of the writes to the same cache line prevails?

推荐答案

假设非临时存储的解析内存类型是 WC(或 WC+)，这就是我认为您要问的问题，答案是大多数情况下不在 Intel 和 AMD 处理器上.

Assuming that the resolved memory type of the non-temporal stores is WC (or WC+), which is what I think you're asking about, the answer is mostly not on Intel and AMD processors.

对于英特尔处理器，英特尔 SDM V2 第 11.3.1 节中的某些语句指定了在具有至少一个 WC 缓冲区的微架构上的写组合写入行为.

For Intel processors, certain statements from Section 11.3.1 of the Intel SDM V2 specify the behavior of write-combing writes on microarchitecturs with at least one WC buffer.

驱逐 WC 缓冲区的协议取决于实现不应依赖软件来保证系统内存的一致性.

The protocol for evicting the WC buffers is implementation dependent and should not be relied on by software for system memory coherency.

这是一个一般性声明，表示 WC 驱逐的原因和为驱逐 WC 缓冲区而执行的事务是依赖于实现的.但是在手册的不同地方有具体的说明.

This is a general statement that says that the causes of WC evictions and transactions performed for evicting a WC buffer are implementation-dependent. But there are specific statements in different places in the manual.

同样[像在 P6 上]，对于较新的处理器，从那些开始基于 Intel NetBurst 微架构，完整的 WC 缓冲区将始终作为单个突发事务传播，使用任何块交易中的订单.

Likewise [like on P6], for more recent processors starting with those based on Intel NetBurst microarchitectures, a full WC buffer will always be propagated as a single burst transactions, using any chunk order within a transaction.

如果同一个 WC 缓冲区中的所有字节都是有效的，这意味着自分配缓冲区以来每个字节至少被写入一次，当缓冲区因任何原因被逐出时，缓冲区中的整个缓存行将被逐出使用单笔交易.如果缓冲区的目标是内存控制器，它是 CLX 上持久域中的第一个单元，那么要么持久化事务的所有字节，要么不持久化任何字节.这意味着已写入该行的写入指令的程序顺序被保持.稍后将讨论这些特定写入和其他写入之间的顺序.

If all the bytes in the same WC buffer are valid, meaning that each byte was written to at least once since the buffer was allocated, when the buffer is evicted for any reason, the entire cache line in the buffer is evicted using a single transaction. If the target of the buffer is a memory controller, which is the the first unit in the persistence domain on CLX, either all the bytes of the transaction are persisted or none of the bytes. This implies that the program order of write instructions that have written into that line is maintained. The ordering between these particular writes and other writes will be discussed later.

在事务中使用任何块顺序"当事务的目标是内存控制器时，在此上下文中的部分从软件的角度来看并不重要，但对于其他目标很重要.

The "using any chunk order within a transaction" part in this context is not important from the perspective of software when the target of the transaction is a memory controller, but is important for other targets.

英特尔已将所有微架构上的块大小指定为 8 字节对齐.此块大小仅适用于核心和非核心互连，但不适用于实现其他协议的范围.但是对于针对 IMC 的写入，在事务的粒度上保证持久原子性，它可能包含 1 到 64 个字节(所有现代 Intel 和 AMD 处理器上的 WC 缓冲区的大小为 64 个字节)，具体取决于当缓冲区被逐出时，同一个 WC 缓冲区中有效字节的分布取决于确切的逐出协议.在 Intel 处理器上，事务保证包含所有 64 个有效字节，以防出现完整的 WC 缓冲区驱逐.

Intel has specified the chunk size to be aligned 8 bytes on all microarchitectures. This chunk size only applies on the core and uncore interconnects, but not beyond that where other protocols are implemented. But with respect to writes targeting an IMC, persist atomicity is guaranteed at the granularity of a transaction, which may contain anywhere from 1 to 64 bytes (the size of a WC buffer on all modern Intel and AMD processors is 64 bytes), depending on the distribution of valid bytes within the same WC buffer at the time when the buffer got evicted and depending on the exact eviction protocol. On Intel processors, the transaction is guaranteed to contain all of the 64 valid bytes in case of a full WC buffer eviction.

AMD 手册只说完整的 WC 缓冲区驱逐可以作为单个事务执行.

The AMD manual only says that full a WC buffer eviction can be performed as a single transaction.

以下引用指定了在部分 WC 缓冲区驱逐(其中并非所有字节在缓冲区中都标记为有效)的情况下的排序保证以及不同 WC 缓冲区中写入之间的排序.它适用于 Intel 和 AMD 处理器.

The following quote specifies ordering guarantees in the case partial WC buffer evictions (where not all bytes are marked as valid in the buffer) and ordering between writes in different WC buffers. It applies to Intel and AMD processors.

一旦 WC 缓冲区的驱逐开始，数据将受到其定义的弱排序语义.

Once the eviction of a WC buffer has started, the data is subject to the weak ordering semantics of its definition.

该段的其余部分继续详细说明.可以使用一个或多个事务来驱逐部分 WC 缓冲区，并且这些事务之间没有排序保证.一旦写入指令提交到 WC 缓冲区，它在程序顺序中的位置就完全丢失了.如果这些事务的目标是 IMC，则持久原子性仅以单个事务的粒度提供.这就是具有有效内存类型 WC 的写入可以持久化而不持久化早期 WC 写入的方式.如果不同的写入指令在同一 WC 缓冲区内部分重叠，则写入指令可能会变得部分持久，无序相对于同一 WC 缓冲区中的其他写入.WC 缓冲区中跨越块边界的写入操作在架构上不能保证是原子的，除非在合并写入后缓冲区完全满(在 Intel 处理器上).

The rest of the paragraph proceeds to elaborate. A partial WC buffer can be evicted using one or more transactions and there is no ordering guarantees between these transactions. Once a write instruction is committed to a WC buffer, it's location in program order is completely lost. If the target of these transactions is an IMC, persist atomicity is only provided at the granularity of a single transaction. That's how a write with effective memory type of WC can persist without persisting an earlier WC write. If different write instructions partially overlap within the same WC buffer, a write instruction can become partially persistent out of order with respect to other writes in the same WC buffer. A write operation in a WC buffer that crosses a chunk boundary is not architecturally guaranteed to be atomic, unless the buffer is entirely full after combining the write (on Intel processors).

WC 缓冲区可以以不同于缓冲区分配顺序的顺序被逐出.栅栏指令不能用于有选择地刷新 WC 缓冲区.但是，除 WC 之外的任何类型的写入，其中存在重叠分配的 WC 缓冲区，都会导致该缓冲区在执行写入之前被逐出.在 WCB 中命中的负载可能不会导致缓冲区被逐出.

WC buffers can be evicted in an order that is different from the buffer allocation order. Fence instructions cannot be used to selectively flush WC buffers. However, a write of any type other than WC where there is an overlapping allocated WC buffer causes that buffer in particular to be evicted before performing the write. A load that hits in a WCB may not cause the buffer to be evicted.

刷新单个 WC 缓冲区的事务不一定相对于刷新同一物理内核中的另一个 WC 缓冲区的事务进行排序.即使实现了 WC 驱逐逻辑，使得 WC 缓冲区被串行驱逐(这很可能)，也不能保证来自不同 WC 缓冲区的事务最终不会在物理核心域之外交错.

The transactions that occur to flush a single WC buffer are not necessarily ordered with respect to the transactions that occur to flush another WC buffer in the same physical core. Even if WC eviction logic is implemented such that WC buffers are evicted in serially, which is likely, there is no guarantee that transactions from different WC buffers won't end up being interleaved outside the physical core domain.

这一切都意味着，即使在同一物理内核中，也不能保证同一 WC 缓冲区和不同 WC 缓冲区的不同块之间的持久排序.

This all means that persist ordering is not guaranteed between different chunks of the same WC buffer and of different WC buffers, even in the same physical core.

导致 WC 缓冲区被逐出的事件可能因供应商和来自同一供应商的处理器而异.一些事件是架构性的(记录在开发人员手册中)，而其他事件是特定于实现的(记录在数据表中).存储序列化指令是同步事件的一个例子，它保证刷新同一逻辑核心上的所有 WC 缓冲区.传递到逻辑内核的硬件中断是异步事件的一个示例，该事件也会导致其所有 WC 缓冲区被逐出.此外，每个物理或逻辑核心的 WC 缓冲区数量取决于实现，并且可能为零.WC 缓冲区的大小也与实现相关，从架构上讲，可能大于或小于 L1D 缓存线的大小.除了结合 WC 写入之外，WC 缓冲区还可用于多种用途，具体取决于微架构.

The events that cause a WC buffer to be evicted may differ between vendors and processors from the same vendor. Some events are architectural (documented in the developer manuals) while others are implementation-specific (documented in the datasheets). Store serializing instructions are an example of a synchronous event that does guarantee flushing all WC buffers on the same logical core. A hardware interrupt delivered to a logical core is an example of an asynchronous event that also causes all of its WC buffers to be evicted. Moreover, the number of WC buffers per physical or logical core is implementation-dependent and could be zero. The size of a WC buffer is also implementation-dependent and could be, architecturally speaking, larger or smaller than the size of an L1D cache line. Also WC buffers could be used for multiple purposes other than combining WC writes, depending on the microarchitecure.

因此，即使您只写入完整的 WC 缓冲区，也无法确保仅在 WC 缓冲区变满时才将其驱逐以实现持久原子性，即使在使用执行完整 WC 驱逐的 Intel 处理器上单笔交易.

Therefore, even if you're only writing full WC buffers, it's impossible to ensure that a WC buffer is only evicted when it becomes full for the purpose of persist atomicity, even on Intel processors where a full WC eviction is performed using a single transaction.

您可以使用 MOVDIR64B 代替执行多个 WC 写指令，这保证了原子性.MOVDIR64B 不分配WC缓冲区并直接到达目的地，但它可能与已分配的WC缓冲区组合，在这种情况下，在组合缓冲区现有内容后立即驱逐缓冲区和 MOVDIR64B.在任何情况下，MOVDIR64B 的写操作总是作为单个事务执行.请注意，MOVDIR64B 的目标内存操作数需要在 64 字节边界上对齐.与传统的 WC 商店类似，MOVDIR64B 与任何其他商店(UC 除外)弱排序.MOVDIR64B 支持 TNT、TGL 和 SPR.

Instead of performing multiple WC write instructions, you can use MOVDIR64B, which guarantees atomicity. MOVDIR64B doesn't allocate a WC buffer and goes directly to the destination, but it may be combined with an already allocated WC buffer, in which case the buffer is evicted immediately after combining the existing contents of the buffer and MOVDIR64B. In any case, the write operation of MOVDIR64B is always performed as a single transaction. Note that the destination memory operand of MOVDIR64B is required to be aligned on a 64-byte boundary. Similar to a traditional WC store, MOVDIR64B is weakly-ordered with any other store, except UC. MOVDIR64B is supported on TNT, TGL, and SPR.

WC/WC+ 写入不针对任何内存类型的其他写入进行排序，但 Intel 和 AMD 处理器上的 UC 除外.此外，跨越对齐的 8 字节边界的任何内存类型的单个写入指令(或写入物理内存地址空间的指令)本身不能保证在超过对齐的 8 字节的粒度上是原子的.这包括持久原子性.唯一的例外是 MOVDIR64B、ENQCMD 和 ENQCMDS.在进行 MMIO 写入时，最后两个是相关的.对齐的 64 字节 AVX-512 存储可能是持久原子的，但这在架构上没有保证，不应依赖.

A WC/WC+ write is not ordered with respect to other writes of any memory type except UC on Intel and AMD processors. In addition, a single write instruction (or an instruction that writes to the physical memory address space) of any memory type that crosses an aligned 8-byte boundary is itself not guaranteed to be atomic at a granularity beyond aligned 8-bytes. This includes persist atomicity. The only exceptions are MOVDIR64B, ENQCMD, and ENQCMDS. The last two are relevant when doing MMIO writes. Aligned 64-byte AVX-512 stores are likely to be persistently atomic, but this is not architecturally guaranteed and should not be relied upon.

这篇关于(Persistence) 将 Intel 非临时存储排序到同一高速缓存行的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

(Persistence) 将 Intel 非临时存储排序到同一高速缓存行 [英] (Persistence) ordering of Intel non-temporal stores to the same cache line

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

(Persistence) 将 Intel 非临时存储排序到同一高速缓存行 [英] (Persistence) ordering of Intel non-temporal stores to the same cache line

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭