为什么在 x86 上自然对齐的变量原子上的整数赋值是原子的? [英] Why is integer assignment on a naturally aligned variable atomic on x86?

查看:48
本文介绍了为什么在 x86 上自然对齐的变量原子上的整数赋值是原子的?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在阅读关于原子操作的这篇文章,并提到 32 位整数赋值在 x86 上是原子的,只要变量自然对齐即可.

为什么自然对齐可以保证原子性?

解决方案

自然"对齐意味着对齐到它自己的类型宽度.因此,加载/存储永远不会跨越任何比自身更宽的边界(例如页面、缓存行,或者用于不同缓存之间数据传输的更窄的块大小).

CPU 经常以 2 的幂大小的块的形式在内核之间执行缓存访问或缓存线传输等操作,因此比缓存线小的对齐边界很重要.(见下面@BeeOnRope 的评论).另请参阅 x86 上的原子性,了解有关 CPU 如何在内部实现原子加载或存储的更多详细信息,以及 "int num" 的 num++ 可以是原子的吗? 了解更多关于原子 RMW 操作的信息像 atomic::fetch_add()/lock xadd 是在内部实现的.


首先,这假设 int 是用单个存储指令更新的,而不是单独写入不同的字节.这是 std::atomic 保证的一部分,但是普通的 C 或 C++ 没有.不过,通常会是这样.x86-64 System V ABI 不禁止编译器从访问 int 变量非原子,即使它确实要求 int 为 4B,默认对齐为 4B.例如,x = a<<16 |b 如果编译器需要,可以编译成两个独立的 16 位存储.

数据竞争在 C 和 C++ 中都是未定义行为,因此编译器可以并且确实假设内存不是异步修改的.对于保证不会破坏的代码,使用 C11 stdatomic 或 C++11 std::atomic.否则编译器只会在寄存器中保留一个值 而不是每次阅读时重新加载,就像 volatile 一样,但有实际保证和语言标准的官方支持.

在 C++11 之前,原子操作通常是用 volatile 或其他东西完成的,并且在我们关心的编译器上工作"的健康剂量,所以 C++11 是一个巨大的向前一步.现在你不再需要关心编译器为普通的 int 做了什么;只需使用 atomic.如果您发现旧指南谈论 int 的原子性,它们可能早于 C++11.何时将 volatile 与多线程结合使用?解释了原因这在实践中是有效的,并且带有 memory_order_relaxedatomic 是获得相同功能的现代方法.

std::atomic共享;//共享变量(编译器确保对齐)整数 x;//局部变量(编译器可以将其保存在寄存器中)x = shared.load(std::memory_order_relaxed);shared.store(x, std::memory_order_relaxed);//共享 = x;//除非您确实需要 seq_cst,否则不要这样做,因为 MFENCE 或 XCHG 比简单的存储慢得多

旁注:对于 atomic 大于 CPU 原子所能做的事情(所以 .is_lock_free() 是假的),见 std::atomic 的锁在哪里?.不过,intint64_t/uint64_t 在所有主要的 x86 编译器上都是无锁的.


因此,我们只需要讨论像 mov [shared], eax 这样的指令的行为.


TL;DR:x86 ISA 保证自然对齐的存储和加载是原子的,最多 64 位宽.因此编译器可以使用普通的存储/加载,只要它们确保 std::atomic 具有自然对齐.

(但请注意,对于结构内的 C11 _Atomic 64 位类型,i386 gcc -m32 无法做到这一点,仅将它们与 4B 对齐,因此 atomic_llong 在某些情况下可以是非原子的.https:///gcc.gnu.org/bugzilla/show_bug.cgi?id=65146#c4).g++ -m32std::atomic 很好,至少在 g++5 中是这样,因为 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65147 在 2015 年通过对 的更改得到修复<atomic> 标头.不过,这并没有改变 C11 的行为.)


IIRC,有 SMP 386 系统,但是直到 486 才建立当前的内存语义.这就是手册上说486 和更新"的原因.

摘自英特尔® 64 位和 IA-32 架构软件开发人员手册,第 3 卷",我的笔记以斜体显示.(另请参阅 标签维基链接:当前版本所有卷,或直接链接到 2015 年 12 月 vol3 pdf 的第 256 页)

在 x86 术语中,词"是指是两个 8 位字节.32 位是双字或 DWORD.

<块引用>

###Section 8.1.1 保证原子操作

<块引用>

Intel486 处理器(以及后来的处理器)保证以下基本内存操作将始终以原子方式执行:

<块引用>

  • 读取或写入一个字节
  • 读取或写入以 16 位边界对齐的字
  • 读取或写入以 32 位边界对齐的双字 (这是自然对齐"的另一种说法)

我加粗的最后一点是您问题的答案:这种行为是处理器成为 x86 CPU(即 ISA 的实现)所需的一部分.


本节的其余部分为较新的 Intel CPU 提供进一步的保证:Pentium 将此保证扩展到 64 位.

<块引用>

该奔腾处理器(以及后来的处理器)保证将始终执行以下额外的内存操作原子地:

<块引用>

  • 读取或写入在 64 位边界上对齐的四字(例如,doublecmpxchg8b(Pentium P5 中的新功能)的 x87 加载/存储)
  • 对适合 32 位数据总线的未缓存内存位置进行 16 位访问.

该部分继续指出跨缓存行(和页面边界)拆分的访问不能保证是原子的,并且:

<块引用>

"访问大于四字的数据的 x87 指令或 SSE 指令可以使用多次内存访问."


AMD 的手册与 Intel 关于对齐的 64 位和更窄的负载一致/存储是原子的

因此整数、x87 和 MMX/SSE 加载/存储高达 64b,即使在 32 位或 16 位模式下(例如 movqmovsdmovhpspinsrqextractps 等) 如果数据对齐则是原子的.gcc -m32 使用 movq xmm, [mem]std::atomic 之类的东西实现原子 64 位加载.Clang4.0 -m32 不幸的是使用 lock cmpxchg8b 错误 33109.

在某些具有 128b 或 256b 内部数据路径(在执行单元和 L1 之间,以及在不同缓存之间)的 CPU 上,128b 甚至 256b 向量加载/存储是原子的,但这由任何标准或在运行时易于查询,不幸的是编译器实现std::atomic<__int128> 或 16B 结构.

如果您想在所有 x86 系统上使用原子 128b,您必须使用 lock cmpxchg16b(仅在 64 位模式下可用).(并且它在第一代 x86-64 CPU 中不可用.您需要将 -mcx16 与 GCC/Clang 一起使用 让他们发出它.)

即使在内部执行原子 128b 加载/存储的 CPU 也可以在具有以较小块运行的一致性协议的多插槽系统中表现出非原子行为:例如带线程的AMD Opteron 2435 (K10)在单独的套接字上运行,通过 HyperTransport 连接.


Intel 和 AMD 的手册在对可缓存内存的未对齐访问方面存在分歧.所有 x86 CPU 的通用子集是 AMD 规则.可缓存意味着回写或直写内存区域,而不是不可缓存或写组合,如 PAT 或 MTRR 区域设置的那样.它们并不意味着 L1 缓存中的缓存行必须已经很热.

  • 英特尔 P6 及更高版本保证最多 64 位可缓存加载/存储的原子性,只要它们位于单个缓存行(64B,或在 Pentium III 等非常旧的 CPU 上为 32B).
  • AMD 保证适合单个 8B 对齐块的可缓存加载/存储的原子性.这是有道理的,因为我们从多插槽 Opteron 上的 16B-store 测试中知道 HyperTransport仅以 8B 块传输,传输时不锁定以防止撕裂.(看上面).我猜 lock cmpxchg16b 必须特别处理.

可能相关:AMD 使用 MOESI 在不同的缓存之间直接共享脏缓存行核心,因此一个核心可以从其有效的缓存行副本中读取数据,而对它的更新则来自另一个缓存.

英特尔使用 MESIF,这需要脏数据传播到大型共享包容性 L3缓存作为一致性流量的支持.L3 包含标签,包括每核 L2/L1 缓存,即使对于由于在每核 L1 缓存中为 M 或 E 而必须在 L3 中处于无效状态的行也是如此.在 Haswell/Skylake 中,L3 和每核缓存之间的数据路径只有 32B 宽,因此它必须缓冲或避免在读取缓存线的两半之间发生从一个内核写入 L3,这可能导致撕裂32B边界.

手册的相关部分:

<块引用>

P6 系列处理器(和更新的 Intel 处理器因为)保证以下额外的内存操作将始终以原子方式执行:

  • 对适合缓存线的缓存内存进行未对齐的 16 位、32 位和 64 位访问.

<块引用>

AMD64 手册 7.3.2 访问原子性
可缓存、自然对齐的单个加载或最多四字的存储在任何处理器上都是原子的模型,以及小于四字的未对齐加载或存储完全包含在一个自然对齐的四字中

请注意,AMD 保证任何小于 qword 的负载的原子性,但 Intel 仅保证 2 的幂大小.32 位保护模式和 64 位长模式可以将 48 位 m16:32 作为内存操作数加载到 cs:eipfar-call 或 far-jmp.(并且远调用将内容压入堆栈.)IDK,如果这算作单个 48 位访问或单独的 16 位和 32 位.

已经尝试将 x86 内存模型形式化,最新的一个是 2009 年的 x86-TSO(扩展版)论文(链接来自 标记 wiki).由于他们定义了一些符号来用他们自己的符号来表达事物,因此它不是有用的 skimmable,我还没有尝试真正阅读它.IDK,如果它描述原子性规则,或者它只关心内存排序.


原子读-修改-写

我提到了 cmpxchg8b,但我只是在谈论负载和存储分别是原子的(即没有撕裂",其中一半负载来自一个存储,另一半负载来自不同的商店).

为了防止该内存位置的内容在加载和存储之间被修改,您需要lock cmpxchg8b,就像您需要 lock inc [mem] 使整个读取-修改-写入成为原子一样.另请注意,即使没有 lockcmpxchg8b 执行单个原子加载(以及可选的存储),通常将其用作 64b 加载并带有 expected=desired 也是不安全的.如果内存中的值恰好符合您的预期,您将获得该位置的非原子读-修改-写.

lock 前缀甚至可以使跨缓存行或页面边界的未对齐访问原子化,但是您不能将它与 mov 一起使用来进行未对齐的存储或加载原子.它仅适用于内存目标读取-修改-写入指令,例如 add [mem], eax.

(lock 隐含在 xchg reg, [mem] 中,所以不要将 xchg 与 mem 一起使用以节省代码大小或指令计数,除非性能无关紧要.仅在您想要内存屏障和/或原子交换时使用它,或者当代码大小是唯一重要的事情时,例如在引导扇区中.)

另见:num++ 可以是原子的吗?'int num'?


为什么 lock mov [mem], reg 不存在原子未对齐存储

来自指令参考手册(Intel x86 手册 vol2),cmpxchg:

<块引用>

此指令可以与 LOCK 前缀一起使用,以允许指令以原子方式执行.为了简化界面处理器的总线,目标操作数接收一个写周期不考虑比较的结果.目的地如果比较失败,则写回操作数;否则,来源操作数写入目标.(处理器从不产生锁定读取,但不会产生锁定写入.)

在将内存控制器内置到 CPU 中之前,此设计决策降低了芯片组的复杂性.对于命中 PCI-express 总线而不是 DRAM 的 MMIO 区域上的 locked 指令,它仍然可以这样做.lock mov reg, [MMIO_PORT] 产生对内存映射 I/O 寄存器的写入和读取只会令人困惑.

另一种解释是确保您的数据具有自然对齐并不是很难,并且与仅确保您的数据对齐相比,lock store 的表现会很糟糕.将晶体管用于速度如此之慢以至于不值得使用的东西是愚蠢的.如果你真的需要它(并且不介意读取内存),你可以使用 xchg [mem], reg(XCHG 有一个隐式的 LOCK 前缀),这比假设的 <代码>锁定移动.

使用 lock 前缀也是一个完整的内存屏障,因此它会带来超出原子 RMW 的性能开销.即 x86 不能做宽松的原子 RMW(不刷新存储缓冲区).其他 ISA 可以,因此在非 x86 上使用 .fetch_add(1, memory_order_relaxed) 会更快.

有趣的事实:在 mfence 存在之前,一个常见的习惯用法是 lock add dword [esp], 0,这是一个除了 clobbering flags 和执行锁定操作.[esp] 在 L1 缓存中几乎总是很热,不会引起与任何其他内核的争用.作为独立的内存屏障,这个习惯用法可能仍然比 MFENCE 更有效,尤其是在 AMD CPU 上.

xchg [mem], reg 可能是实现顺序一致性存储的最有效方法,相对于 mov+mfence,在英特尔和 AMD 上.mfence在 Skylake 上至少会阻止非内存指令的乱序执行,但是 xchg 和其他 locked ops 不会. gcc 以外的编译器一定要使用 xchg 来存储,即使他们不关心读取旧值.


此设计决策的动机:

如果没有它,软件将不得不使用 1 字节锁(或某种可用的原子类型)来保护对 32 位整数的访问,与共享原子读访问相比,这是非常低效的,例如全局时间戳变量更新一个定时器中断.它可能在硅片中基本上是免费的,以保证总线宽度或更小的对齐访问.

为了使锁定成为可能,需要某种原子访问.(实际上,我猜硬件可以提供某种完全不同的硬件辅助锁定机制.)对于在其外部数据总线上进行 32 位传输的 CPU,将其作为原子性单位才有意义.


既然你提供了赏金,我想你正在寻找一个很长的答案,其中包含所有有趣的副主题.如果您认为您认为本问答对未来读者更有价值,请告诉我我没有涵盖的内容.

由于您在问题中链接了一个我强烈建议您阅读更多 Jeff Preshing 的博文.它们非常出色,并帮助我将我所知道的部分整合在一起,以了解不同硬件架构的 C/C++ 源代码与 asm 中的内存排序,以及如何/何时告诉编译器你想要什么(如果你不是)t 直接写 asm.

I've been reading this article about atomic operations, and it mentions 32-bit integer assignment being atomic on x86, as long as the variable is naturally aligned.

Why does natural alignment assure atomicity?

解决方案

"Natural" alignment means aligned to its own type width. Thus, the load/store will never be split across any kind of boundary wider than itself (e.g. page, cache-line, or an even narrower chunk size used for data transfers between different caches).

CPUs often do things like cache-access, or cache-line transfers between cores, in power-of-2 sized chunks, so alignment boundaries smaller than a cache line do matter. (See @BeeOnRope's comments below). See also Atomicity on x86 for more details on how CPUs implement atomic loads or stores internally, and Can num++ be atomic for 'int num'? for more about how atomic RMW operations like atomic<int>::fetch_add() / lock xadd are implemented internally.


First, this assumes that the int is updated with a single store instruction, rather than writing different bytes separately. This is part of what std::atomic guarantees, but that plain C or C++ doesn't. It will normally be the case, though. The x86-64 System V ABI doesn't forbid compilers from making accesses to int variables non-atomic, even though it does require int to be 4B with a default alignment of 4B. For example, x = a<<16 | b could compile to two separate 16-bit stores if the compiler wanted.

Data races are Undefined Behaviour in both C and C++, so compilers can and do assume that memory is not asynchronously modified. For code that is guaranteed not to break, use C11 stdatomic or C++11 std::atomic. Otherwise the compiler will just keep a value in a register instead of reloading every time your read it, like volatile but with actual guarantees and official support from the language standard.

Before C++11, atomic ops were usually done with volatile or other things, and a healthy dose of "works on compilers we care about", so C++11 was a huge step forward. Now you no longer have to care about what a compiler does for plain int; just use atomic<int>. If you find old guides talking about atomicity of int, they probably predate C++11. When to use volatile with multi threading? explains why that works in practice, and that atomic<T> with memory_order_relaxed is the modern way to get the same functionality.

std::atomic<int> shared;  // shared variable (compiler ensures alignment)

int x;           // local variable (compiler can keep it in a register)
x = shared.load(std::memory_order_relaxed);
shared.store(x, std::memory_order_relaxed);
// shared = x;  // don't do that unless you actually need seq_cst, because MFENCE or XCHG is much slower than a simple store

Side-note: for atomic<T> larger than the CPU can do atomically (so .is_lock_free() is false), see Where is the lock for a std::atomic?. int and int64_t / uint64_t are lock-free on all the major x86 compilers, though.


Thus, we just need to talk about the behaviour of an instruction like mov [shared], eax.


TL;DR: The x86 ISA guarantees that naturally-aligned stores and loads are atomic, up to 64bits wide. So compilers can use ordinary stores/loads as long as they ensure that std::atomic<T> has natural alignment.

(But note that i386 gcc -m32 fails to do that for C11 _Atomic 64-bit types inside structs, only aligning them to 4B, so atomic_llong can be non-atomic in some cases. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65146#c4). g++ -m32 with std::atomic is fine, at least in g++5 because https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65147 was fixed in 2015 by a change to the <atomic> header. That didn't change the C11 behaviour, though.)


IIRC, there were SMP 386 systems, but the current memory semantics weren't established until 486. This is why the manual says "486 and newer".

From the "Intel® 64 and IA-32 Architectures Software Developer Manuals, volume 3", with my notes in italics. (see also the tag wiki for links: current versions of all volumes, or direct link to page 256 of the vol3 pdf from Dec 2015)

In x86 terminology, a "word" is two 8-bit bytes. 32 bits are a double-word, or DWORD.

###Section 8.1.1 Guaranteed Atomic Operations

The Intel486 processor (and newer processors since) guarantees that the following basic memory operations will always be carried out atomically:

  • Reading or writing a byte
  • Reading or writing a word aligned on a 16-bit boundary
  • Reading or writing a doubleword aligned on a 32-bit boundary (This is another way of saying "natural alignment")

That last point that I bolded is the answer to your question: This behaviour is part of what's required for a processor to be an x86 CPU (i.e. an implementation of the ISA).


The rest of the section provides further guarantees for newer Intel CPUs: Pentium widens this guarantee to 64 bits.

The Pentium processor (and newer processors since) guarantees that the following additional memory operations will always be carried out atomically:

  • Reading or writing a quadword aligned on a 64-bit boundary (e.g. x87 load/store of a double, or cmpxchg8b (which was new in Pentium P5))
  • 16-bit accesses to uncached memory locations that fit within a 32-bit data bus.

The section goes on to point out that accesses split across cache lines (and page boundaries) are not guaranteed to be atomic, and:

"An x87 instruction or an SSE instructions that accesses data larger than a quadword may be implemented using multiple memory accesses."


AMD's manual agrees with Intel's about aligned 64-bit and narrower loads/stores being atomic

So integer, x87, and MMX/SSE loads/stores up to 64b, even in 32-bit or 16-bit mode (e.g. movq, movsd, movhps, pinsrq, extractps, etc.) are atomic if the data is aligned. gcc -m32 uses movq xmm, [mem] to implement atomic 64-bit loads for things like std::atomic<int64_t>. Clang4.0 -m32 unfortunately uses lock cmpxchg8b bug 33109.

On some CPUs with 128b or 256b internal data paths (between execution units and L1, and between different caches), 128b and even 256b vector loads/stores are atomic, but this is not guaranteed by any standard or easily queryable at run-time, unfortunately for compilers implementing std::atomic<__int128> or 16B structs.

If you want atomic 128b across all x86 systems, you must use lock cmpxchg16b (available only in 64bit mode). (And it wasn't available in the first-gen x86-64 CPUs. You need to use -mcx16 with GCC/Clang for them to emit it.)

Even CPUs that internally do atomic 128b loads/stores can exhibit non-atomic behaviour in multi-socket systems with a coherency protocol that operates in smaller chunks: e.g. AMD Opteron 2435 (K10) with threads running on separate sockets, connected with HyperTransport.


Intel's and AMD's manuals diverge for unaligned access to cacheable memory. The common subset for all x86 CPUs is the AMD rule. Cacheable means write-back or write-through memory regions, not uncacheable or write-combining, as set with PAT or MTRR regions. They don't mean that the cache-line has to already be hot in L1 cache.

  • Intel P6 and later guarantee atomicity for cacheable loads/stores up to 64 bits as long as they're within a single cache-line (64B, or 32B on very old CPUs like Pentium III).
  • AMD guarantees atomicity for cacheable loads/stores that fit within a single 8B-aligned chunk. That makes sense, because we know from the 16B-store test on multi-socket Opteron that HyperTransport only transfers in 8B chunks, and doesn't lock while transferring to prevent tearing. (See above). I guess lock cmpxchg16b must be handled specially.

Possibly related: AMD uses MOESI to share dirty cache-lines directly between caches in different cores, so one core can be reading from its valid copy of a cache line while updates to it are coming in from another cache.

Intel uses MESIF, which requires dirty data to propagate out to the large shared inclusive L3 cache which acts as a backstop for coherency traffic. L3 is tag-inclusive of per-core L2/L1 caches, even for lines that have to be in the Invalid state in L3 because of being M or E in a per-core L1 cache. The data path between L3 and per-core caches is only 32B wide in Haswell/Skylake, so it must buffer or something to avoid a write to L3 from one core happening between reads of two halves of a cache line, which could cause tearing at the 32B boundary.

The relevant sections of the manuals:

The P6 family processors (and newer Intel processors since) guarantee that the following additional memory operation will always be carried out atomically:

  • Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit within a cache line.

AMD64 Manual 7.3.2 Access Atomicity
Cacheable, naturally-aligned single loads or stores of up to a quadword are atomic on any processor model, as are misaligned loads or stores of less than a quadword that are contained entirely within a naturally-aligned quadword

Notice that AMD guarantees atomicity for any load smaller than a qword, but Intel only for power-of-2 sizes. 32-bit protected mode and 64-bit long mode can load a 48 bit m16:32 as a memory operand into cs:eip with far-call or far-jmp. (And far-call pushes stuff on the stack.) IDK if this counts as a single 48-bit access or separate 16 and 32-bit.

There have been attempts to formalize the x86 memory model, the latest one being the x86-TSO (extended version) paper from 2009 (link from the memory-ordering section of the tag wiki). It's not usefully skimmable since they define some symbols to express things in their own notation, and I haven't tried to really read it. IDK if it describes the atomicity rules, or if it's only concerned with memory ordering.


Atomic Read-Modify-Write

I mentioned cmpxchg8b, but I was only talking about the load and the store each separately being atomic (i.e. no "tearing" where one half of the load is from one store, the other half of the load is from a different store).

To prevent the contents of that memory location from being modified between the load and the store, you need lock cmpxchg8b, just like you need lock inc [mem] for the entire read-modify-write to be atomic. Also note that even if cmpxchg8b without lock does a single atomic load (and optionally a store), it's not safe in general to use it as a 64b load with expected=desired. If the value in memory happens to match your expected, you'll get a non-atomic read-modify-write of that location.

The lock prefix makes even unaligned accesses that cross cache-line or page boundaries atomic, but you can't use it with mov to make an unaligned store or load atomic. It's only usable with memory-destination read-modify-write instructions like add [mem], eax.

(lock is implicit in xchg reg, [mem], so don't use xchg with mem to save code-size or instruction count unless performance is irrelevant. Only use it when you want the memory barrier and/or the atomic exchange, or when code-size is the only thing that matters, e.g. in a boot sector.)

See also: Can num++ be atomic for 'int num'?


Why lock mov [mem], reg doesn't exist for atomic unaligned stores

From the instruction reference manual (Intel x86 manual vol2), cmpxchg:

This instruction can be used with a LOCK prefix to allow the instruction to be executed atomically. To simplify the interface to the processor’s bus, the destination operand receives a write cycle without regard to the result of the comparison. The destination operand is written back if the comparison fails; otherwise, the source operand is written into the destination. (The processor never produces a locked read without also producing a locked write.)

This design decision reduced chipset complexity before the memory controller was built into the CPU. It may still do so for locked instructions on MMIO regions that hit the PCI-express bus rather than DRAM. It would just be confusing for a lock mov reg, [MMIO_PORT] to produce a write as well as a read to the memory-mapped I/O register.

The other explanation is that it's not very hard to make sure your data has natural alignment, and lock store would perform horribly compared to just making sure your data is aligned. It would be silly to spend transistors on something that would be so slow it wouldn't be worth using. If you really need it (and don't mind reading the memory too), you could use xchg [mem], reg (XCHG has an implicit LOCK prefix), which is even slower than a hypothetical lock mov.

Using a lock prefix is also a full memory barrier, so it imposes a performance overhead beyond just the atomic RMW. i.e. x86 can't do relaxed atomic RMW (without flushing the store buffer). Other ISAs can, so using .fetch_add(1, memory_order_relaxed) can be faster on non-x86.

Fun fact: Before mfence existed, a common idiom was lock add dword [esp], 0, which is a no-op other than clobbering flags and doing a locked operation. [esp] is almost always hot in L1 cache and won't cause contention with any other core. This idiom may still be more efficient than MFENCE as a stand-alone memory barrier, especially on AMD CPUs.

xchg [mem], reg is probably the most efficient way to implement a sequential-consistency store, vs. mov+mfence, on both Intel and AMD. mfence on Skylake at least blocks out-of-order execution of non-memory instructions, but xchg and other locked ops don't. Compilers other than gcc do use xchg for stores, even when they don't care about reading the old value.


Motivation for this design decision:

Without it, software would have to use 1-byte locks (or some kind of available atomic type) to guard accesses to 32bit integers, which is hugely inefficient compared to shared atomic read access for something like a global timestamp variable updated by a timer interrupt. It's probably basically free in silicon to guarantee for aligned accesses of bus-width or smaller.

For locking to be possible at all, some kind of atomic access is required. (Actually, I guess the hardware could provide some kind of totally different hardware-assisted locking mechanism.) For a CPU that does 32bit transfers on its external data bus, it just makes sense to have that be the unit of atomicity.


Since you offered a bounty, I assume you were looking for a long answer that wandered into all interesting side topics. Let me know if there are things I didn't cover that you think would make this Q&A more valuable for future readers.

Since you linked one in the question, I highly recommend reading more of Jeff Preshing's blog posts. They're excellent, and helped me put together the pieces of what I knew into an understanding of memory ordering in C/C++ source vs. asm for different hardware architectures, and how / when to tell the compiler what you want if you aren't writing asm directly.

这篇关于为什么在 x86 上自然对齐的变量原子上的整数赋值是原子的?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆