为什么Ice Lake不像Tremont那样有MOVDIRx?他们已经有更好的了吗? [英] Why doesn't Ice Lake have MOVDIRx like tremont? Do they already have better ones?

查看:72
本文介绍了为什么Ice Lake不像Tremont那样有MOVDIRx?他们已经有更好的了吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我注意到英特尔 Tremont 具有64个字节的存储指令,其中包含 MOVDIRI MOVDIR64B .
这些保证原子写入内存,而 保证负载原子性.此外,写入操作的顺序很弱,可能需要紧随其后的防护措施.
我在IceLake中找不到 MOVDIRx .

为什么 冰湖 不需要诸如 MOVDIRx 之类的说明?

(在第15页的底部)
英特尔®体系结构指令集扩展和未来功能编程参考
https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf#page=15

解决方案

Ice Lake具有AVX512,可为我们提供64字节的加载+存储,但不能保证64字节的存储原子性.

我们确实使用 movntps [mem],zmm /

可能英特尔不能(或不想)在其主流CPU上提供原子性保证,仅在不支持多个插槽的低功率芯片上提供,但我没有没有听到有关Intel CPU缓存行中实际存在的任何撕裂的报告.实际上,我认为在当前英特尔CPU上未超过缓存行边界的缓存加载/存储始终是原子的.

(与AMD K10不同,HyperTransport确实在插槽之间的8B边界上产生了撕裂,而在单个插槽的内核之间看不到撕裂. SSE指令:哪些CPU可以执行原子操作16B内存操作?)

在任何情况下,都无法通过CPUID进行检测,并且也没有文档记录,因此基本上无法安全地利用它.如果有一个CPUID叶子告诉您系统和单个套接字内的原子宽度,那将是很好的选择,因此仍然允许将512位AVX512 ops分成256位一半的实现.>

无论如何,我认为与其提供一种具有保证的存储原子性的特殊指令,不如让CPU供应商开始记录并为所有2级幂存储或更大容量的存储原子性提供CPUID检测.仅适用于NT商店或其他商店.

如果AVX512遵循当前的半角矢量实现策略,那么让AVX512的某些部分需要64字节的原子性将使AMD更加难以支持.(Zen2将具有256位矢量ALU,这使AVX1/AVX2指令大部分为单UOP,但不幸的是,据报道它不支持AVX512.即使您仅以256位宽度使用它,AVX512还是一个非常好的ISA,填补了更多可以方便/有效完成的空白,例如unsigned int<-> FP和[u] int64<-> double.)

因此IDK也许是英特尔同意不这样做,或者出于自己的原因而选择不这样做.


用于64B写原子性的用例:

我怀疑主要用例是可靠地创建64字节PCIe交易,本身并不是真正的原子性",也不能供其他内核观察.

如果您关心从其他内核读取数据,通常您希望L3缓存支持数据,而不是将其绕过DRAM.即使可以使用 movdir64B ,seqlock可能是一种在CPU内核之间模拟64字节原子性的更快方法.

Skylake已经具有12个写合并缓冲区(从Haswell中的10个增加),因此使用常规NT存储区创建完整大小的PCIe事务并避免过早刷新并不难(也许吗?).但是,低功耗CPU的缓冲区可能更少,而要可靠地创建到NIC缓冲区等的64B事务可能是一个挑战.

I notice that Intel Tremont has 64 bytes store instructions with MOVDIRI and MOVDIR64B.
Those guarantees atomic write to memory, whereas don't guarantee the load atomicity. Moreover, the write is weakly ordered, immediately followed fencing may be needed.
I find no MOVDIRx in IceLake.

Why doesn't Ice Lake need such instructions like MOVDIRx?

(At the bottom of page 15)
Intel® ArchitectureInstruction Set Extensions and Future FeaturesProgramming Reference
https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf#page=15

解决方案

Ice Lake has AVX512, which gives us 64-byte loads + stores, but no guarantee of 64-byte store atomicity.

We do get 64-byte NT stores with movntps [mem], zmm / movntdq [mem], zmm. Interestingly, NT stores don't support merge-masking to leave some bytes unwritten. That would basically defeat the purpose of NT stores by creating partial-line writes, though.

Probably Ice Lake Pentium / Celeron CPUs still won't have AVX1/2, let alone AVX512 (probably so they can sell chips with defects in the upper 128 bits of the FMA units and/or register file on at least one core), so only rep movsb will be able to internally use 64-byte loads/stores on those CPUs. (IceLake will have the "fast short rep" feature, which may make it useful even for small 64-byte copies, useful in kernel code that can't use vector regs.)


Possibly Intel can't (or doesn't want to) provide that atomicity guarantee on their mainstream CPUs, only on low-power chips that don't support multiple sockets, but I haven't heard any reports of tearing actually existing within a cache line on Intel CPUs. In practice, I think cached loads/stores that don't cross a cache-line boundary on current Intel CPUs are always atomic.

(Unlike on AMD K10 where HyperTransport did create tearing on 8B boundaries between sockets, while no tearing could be seen between cores on a single socket. SSE instructions: which CPUs can do atomic 16B memory operations?)

In any case, there's no way to detect this with CPUID, and it's not documented, so it's basically impossible to safely take advantage. It would be nice if there was a CPUID leaf that told you the atomicity width for the system and for within a single socket, so implementations that split 512-bit AVX512 ops into 256-bit halves would still be allowed....

Anyway, rather than introducing a special instruction with guaranteed store atomicity, I think it would be more likely for CPU vendors to start documenting and providing CPUID detection of wider store atomicity for either all power-of-2-size stores, or for only NT stores, or something.

Making some part of AVX512 require 64-byte atomicity would make it much harder for AMD to support, if they follow their current strategy of half-width vector implementation. (Zen2 will have 256-bit vector ALUs, making AVX1/AVX2 instructions mostly single-uop, but reportedly it won't have AVX512 support, unfortunately. AVX512 is a very nice ISA even if you only use it at 256-bit width, filling more gaps in what can be done conveniently / efficiently, e.g. unsigned int<->FP and [u]int64<->double.)

So IDK if maybe Intel agreed not to do that, or chose not to for their own reasons.


Use case for 64B write atomicity:

I suspect the main use-case is reliably creating 64-byte PCIe transactions, not actually "atomicity" per-se, and not for observation by another core.

If you cared about reading from other cores, normally you'd want L3 cache to backstop the data, not bypass it to DRAM. A seqlock is probably a faster way to emulate 64-byte atomicity between CPU cores, even if movdir64B is available.

Skylake already has 12 write-combining buffers (up from 10 in Haswell), so it's (maybe?) not too hard to use regular NT stores to create a full-size PCIe transaction, avoiding early flushes. But maybe low-power CPUs have fewer buffers and maybe it's a challenge to reliably create 64B transactions to a NIC buffer or something.

这篇关于为什么Ice Lake不像Tremont那样有MOVDIRx?他们已经有更好的了吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆