x86 上的原子性 [英] Atomicity on x86

查看:40
本文介绍了x86 上的原子性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

<块引用>

8.1.2 总线锁定

Intel 64 和 IA-32 处理器提供一个 LOCK# 信号,该信号被断言在某些关键内存操作期间自动锁定系统总线或等效链路.当这个输出信号被断言时,来自其他处理器或总线代理的总线控制请求被阻止.软件可以指定其他场合当LOCK语义之后是将 LOCK 前缀添加到说明.

它来自英特尔手册,第 3 卷

听起来像内存上的原子操作将直接在内存(RAM)上执行.我很困惑,因为我在分析汇编输出时看到没什么特别的".基本上,为 std::atomic 生成的汇编输出X;X.load() 只放置额外"的 mfence.但是,它负责正确的内存排序,而不是原子性.如果我理解正确,X.store(2) 就是 mov [somewhere], $2.就这样.它似乎没有跳过"缓存.我知道将对齐(例如整数)移动到内存是原子的.然而,我很困惑.

<小时>

所以,我提出了我的疑虑,但主要问题是:

CPU内部是如何实现原子操作的?

解决方案

听起来像内存上的原子操作将直接在内存(RAM)上执行.

不,只要系统中每个可能的观察者都将操作视为原子操作,该操作就可以只涉及缓存.

对于原子读-修改-写操作来说,满足这个要求要困难得多

这意味着 CPU 硬件可以做任何必要的事情来确保存储或加载相对于系统中可以观察到的任何是原子的.如果有的话,这可能并不多.DDR 内存使用足够宽的数据总线,因此 64 位对齐存储确实在同一周期内通过内存总线以电气方式传输到 DRAM.(有趣的事实,但并不重要.像 PCIe 这样的串行总线协议不会阻止它成为原子,只要单个消息足够大.而且由于内存控制器是唯一可以直接与 DRAM 通信的东西,它在内部做什么并不重要,只是它与 CPU 其余部分之间的传输大小).但无论如何,这是免费"的部分:不需要临时阻塞其他请求来保持原子传输的原子性.

x86 保证对齐的负载和存储高达 64 位是原子的,但不是更广泛的访问.低功耗实现可以自由地将向量加载/存储分解为 64 位块,就像从 PIII 到 Pentium M 的 P6 一样.

<小时>

原子操作发生在缓存中

请记住,原子只是意味着所有观察者都将其视为已发生或未发生,而不是部分发生.没有要求它实际上立即到达主内存(或者根本没有,如果很快被覆盖).以原子方式修改或读取 L1 缓存足以确保任何其他内核或 DMA 访问将看到对齐的存储或加载作为单个原子操作发生. 如果此修改发生在存储执行后很长时间(例如,因乱序执行而延迟,直到商店退出).

像 Core2 这样到处都有 128 位路径的现代 CPU 通常具有原子 SSE 128b 加载/存储,超出了 x86 ISA 的保证.但请注意一个有趣的例外 在多-socket Opteron 可能是由于超传输. 这证明原子地修改 L1 缓存不足以为比最窄数据路径(在这种情况下不是 L1 缓存和执行之间的路径)更宽的存储提供原子性单位).

对齐很重要:跨越缓存线边界的加载或存储必须在两个单独的访问中完成.这使它成为非原子的.

x86 保证缓存访问最多 8 个字节是原子的,只要它们不跨越 AMD/Intel 上的 8B 边界.(或者仅适用于 P6 及更高版本的英特尔,请勿跨越缓存线边界).这意味着整个缓存线(现代 CPU 上的 64B)在 Intel 上以原子方式传输,即使它比数据路径(Haswell/Skylake 上的 L2 和 L3 之间的 32B)更宽.这种原子性在硬件中并非完全免费",并且可能需要一些额外的逻辑来防止负载读取仅部分传输的缓存行.尽管缓存行传输仅在旧版本失效后发生,因此在传输发生时内核不应从旧副本中读取.AMD 在实践中可以在更小的边界上撕裂,这可能是因为使用了 MESI 的不同扩展,可以在缓存之间传输脏数据.

对于更广泛的操作数,例如将新数据原子地写入结构的多个条目中,您需要使用所有对其进行访问的锁来保护它.(您可以使用 x86 lock cmpxchg16b 和重试循环来执行原子 16b 存储.请注意 没有互斥锁就无法模拟它.)

<小时>

原子读-修改-写变得更难

相关:我对的回答可以是int num"的 num++ 原子吗? 更详细地介绍了这一点.

每个内核都有一个与所有其他内核一致的私有 L1 缓存(使用 MOESI协议).高速缓存行在高速缓存和主存储器的级别之间以大小从 64 位到 256 位不等的块传输.(这些传输实际上可能在整个缓存行粒度上是原子的?)

要进行原子 RMW,内核可以将 L1 缓存的一行保持在修改状态,而无需接受对加载和存储之间受影响的缓存行的任何外部修改,系统的其余部分会将操作视为原子.(因此它原子的,因为通常的乱序执行规则要求本地线程将自己的代码视为按程序顺序运行.)

它可以通过在原子 RMW 进行中时不处理任何缓存一致性消息(或一些更复杂的版本,允许其他操作具有更多并行性)来实现这一点.

未对齐的 lock 操作是一个问题:我们需要其他内核才能看到对两个缓存行的修改作为单个原子操作发生.可能需要实际存储到 DRAM,并获取总线锁定.(AMD 的优化手册说当缓存锁不够用时,他们的 CPU 会发生这种情况.)

8.1.2 Bus Locking

Intel 64 and IA-32 processors provide a LOCK# signal that is asserted automatically during certain critical memory operations to lock the system bus or equivalent link. While this output signal is asserted, requests from other processors or bus agents for control of the bus are blocked. Software can specify other occasions when the LOCK semantics are to be followed by prepending the LOCK prefix to an instruction.

It comes from Intel Manual, Volume 3

It sounds like the atomic operations on memory will be executed directly on memory (RAM). I am confused because I see "nothing special" when I analyze assembly output. Basically, the assembly output generated for std::atomic<int> X; X.load() puts only "extra" mfence. But, it is responsible for proper memory ordering, not for an atomicity. If I understand properly the X.store(2) is just mov [somewhere], $2. And that's all. It seems that it doesn't "skip" the cache. I know that moving aligned ( for example ints) to memory is atomic. However, I am confused.


So, I have presented my doubts but the main question is:

How does the CPU implement atomic operations internally?

解决方案

It sounds like the atomic operations on memory will be executed directly on memory (RAM).

Nope, as long as every possible observer in the system sees the operation as atomic, the operation can involve cache only.

Satisfying this requirement is much more difficult for atomic read-modify-write operations (like lock add [mem], eax, especially with an unaligned address), which is when a CPU might assert the LOCK# signal. You still wouldn't see any more than that in the asm: the hardware implements the ISA-required semantics for locked instructions.

Although I doubt that there is a physical external LOCK# pin on modern CPUs where the memory controller is built-in to the CPU, instead of in a separate northbridge chip.


std::atomic<int> X; X.load() puts only "extra" mfence.

Compilers don't MFENCE for seq_cst loads.

I think I read that old MSVC at one point did emit MFENCE for this (maybe to prevent reordering with unfenced NT stores? Or instead of on stores?). But it doesn't anymore: I tested MSVC 19.00.23026.0. Look for foo and bar in the asm output from this program that dumps its own asm in an online compile&run site.

The reason we don't need a fence here is that the x86 memory model disallows both LoadStore and LoadLoad reordering. Earlier (non seq_cst) stores can still be delayed until after a seq_cst load, so it's different from using a stand-alone std::atomic_thread_fence(mo_seq_cst); before an X.load(mo_acquire);

If I understand properly the X.store(2) is just mov [somewhere], 2

That's consistent with your idea that loads needed mfence; one or the other of seq_cst loads or stores need a full barrier to prevent disallow StoreLoad reordering which could otherwise happen.

In practice compiler devs picked cheap loads (mov) / expensive stores (mov+mfence) because loads are more common. C++11 mappings to processors.

(The x86 memory-ordering model is program order plus a store buffer with store-forwarding (see also). This makes mo_acquire and mo_release free in asm, only need to block compile-time reordering, and lets us choose whether to put the MFENCE full barrier on loads or stores.)

So seq_cst stores are either mov+mfence or xchg. Why does a std::atomic store with sequential consistency use XCHG? discusses the performance advantages of xchg on some CPUs. On AMD, MFENCE is (IIRC) documented to have extra serialize-the-pipeline semantics (for instruction execution, not just memory ordering) that blocks out-of-order exec, and on some Intel CPUs in practice (Skylake) that's also the case.

MSVC's asm for stores is the same as clang's, using xchg to do the store + memory barrier with the same instruction.

Atomic release or relaxed stores can be just mov, with the difference between them being only how much compile-time reordering is allowed.


This question looks like the part 2 of your earlier Memory Model in C++ : sequential consistency and atomicity, where you asked:

How does the CPU implement atomic operations internally?

As you pointed out in the question, atomicity is unrelated to ordering with respect to any other operations. (i.e. memory_order_relaxed). It just means that the operation happens as a single indivisible operation, hence the name, not as multiple parts which can happen partially before and partially after something else.

You get atomicity "for free" with no extra hardware for aligned loads or stores up to the size of the data paths between cores, memory, and I/O busses like PCIe. i.e. between the various levels of cache, and between the caches of separate cores. The memory controllers are part of the CPU in modern designs, so even a PCIe device accessing memory has to go through the CPU's system agent. (This even lets Skylake's eDRAM L4 (not available in any desktop CPUs :( ) work as a memory-side cache (unlike Broadwell, which used it as a victim cache for L3 IIRC), sitting between memory and everything else in the system so it can even cache DMA).

This means the CPU hardware can do whatever is necessary to make sure a store or load is atomic with respect to anything else in the system which can observe it. This is probably not much, if anything. DDR memory uses a wide enough data bus that a 64bit aligned store really does electrically go over the memory bus to the DRAM all in the same cycle. (fun fact, but not important. A serial bus protocol like PCIe wouldn't stop it from being atomic, as long as a single message is big enough. And since the memory controller is the only thing that can talk to the DRAM directly, it doesn't matter what it does internally, just the size of transfers between it and the rest of the CPU). But anyway, this is the "for free" part: no temporary blocking of other requests is needed to keep an atomic transfer atomic.

x86 guarantees that aligned loads and stores up to 64 bits are atomic, but not wider accesses. Low-power implementations are free to break up vector loads/stores into 64-bit chunks like P6 did from PIII until Pentium M.


Atomic ops happen in cache

Remember that atomic just means all observers see it as having happened or not happened, never partially-happened. There's no requirement that it actually reaches main memory right away (or at all, if overwritten soon). Atomically modifying or reading L1 cache is sufficient to ensure that any other core or DMA access will see an aligned store or load happen as a single atomic operation. It's fine if this modification happens long after the store executes (e.g. delayed by out-of-order execution until the store retires).

Modern CPUs like Core2 with 128-bit paths everywhere typically have atomic SSE 128b loads/stores, going beyond what the x86 ISA guarantees. But note the interesting exception on a multi-socket Opteron probably due to hypertransport. That's proof that atomically modifying L1 cache isn't sufficient to provide atomicity for stores wider than the narrowest data path (which in this case isn't the path between L1 cache and the execution units).

Alignment is important: A load or store that crosses a cache-line boundary has to be done in two separate accesses. This makes it non-atomic.

x86 guarantees that cached accesses up to 8 bytes are atomic as long as they don't cross an 8B boundary on AMD/Intel. (Or for Intel only on P6 and later, don't cross a cache-line boundary). This implies that whole cache lines (64B on modern CPUs) are transferred around atomically on Intel, even though that's wider than the data paths (32B between L2 and L3 on Haswell/Skylake). This atomicity isn't totally "free" in hardware, and maybe requires some extra logic to prevent a load from reading a cache-line that's only partially transferred. Although cache-line transfers only happen after the old version was invalidated, so a core shouldn't be reading from the old copy while there's a transfer happening. AMD can tear in practice on smaller boundaries, maybe because of using a different extension to MESI that can transfer dirty data between caches.

For wider operands, like atomically writing new data into multiple entries of a struct, you need to protect it with a lock which all accesses to it respect. (You may be able to use x86 lock cmpxchg16b with a retry loop to do an atomic 16b store. Note that there's no way to emulate it without a mutex.)


Atomic read-modify-write is where it gets harder

related: my answer on Can num++ be atomic for 'int num'? goes into more detail about this.

Each core has a private L1 cache which is coherent with all other cores (using the MOESI protocol). Cache-lines are transferred between levels of cache and main memory in chunks ranging in size from 64 bits to 256 bits. (these transfers may actually be atomic on a whole-cache-line granularity?)

To do an atomic RMW, a core can keep a line of L1 cache in Modified state without accepting any external modifications to the affected cache line between the load and the store, the rest of the system will see the operation as atomic. (And thus it is atomic, because the usual out-of-order execution rules require that the local thread sees its own code as having run in program order.)

It can do this by not processing any cache-coherency messages while the atomic RMW is in-flight (or some more complicated version of this which allows more parallelism for other ops).

Unaligned locked ops are a problem: we need other cores to see modifications to two cache lines happen as a single atomic operation. This may require actually storing to DRAM, and taking a bus lock. (AMD's optimization manual says this is what happens on their CPUs when a cache-lock isn't sufficient.)

这篇关于x86 上的原子性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆