为什么在x86上对自然对齐的可变原子进行整数赋值? [英] Why is integer assignment on a naturally aligned variable atomic on x86?

查看:160
本文介绍了为什么在x86上对自然对齐的可变原子进行整数赋值?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在阅读有关原子操作的文章,其中提到了32位整数只要变量自然对齐,就可以在x86上进行原子分配.

I've been reading this article about atomic operations, and it mentions 32bit integer assignment being atomic on x86, as long as the variable is naturally aligned.

为什么自然排列确保原子性?

Why does natural alignment assure atomicity?

推荐答案

自然"对齐方式是指与其自身的字体宽度对齐.因此,加载/存储将永远不会跨越比其本身更宽的任何边界(例如,页面,缓存行或用于不同缓存之间的数据传输的甚至更窄的块大小).

"Natural" alignment means aligned to it's own type width. Thus, the load/store will never be split across any kind of boundary wider than itself (e.g. page, cache-line, or an even narrower chunk size used for data transfers between different caches).

CPU通常以2的幂为单位执行诸如高速缓存访​​问或内核之间的高速缓存行传输之类的操作,因此比高速缓存行小的对齐边界确实很重要. (请参见下面的@BeeOnRope的注释).另请参阅 x86上的原子,以获取有关CPU如何在内部实现原子负载或存储的更多详细信息,以及 num ++对于"int num"是否可以是原子的?以了解更多有关原子RMW操作方式的信息像atomic<int>::fetch_add()/lock xadd在内部实现.

CPUs often do things like cache-access, or cache-line transfers between cores, in power-of-2 sized chunks, so alignment boundaries smaller than a cache line do matter. (See @BeeOnRope's comments below). See also Atomicity on x86 for more details on how CPUs implement atomic loads or stores internally, and Can num++ be atomic for 'int num'? for more about how atomic RMW operations like atomic<int>::fetch_add() / lock xadd are implemented internally.

首先,这假定int是用单个存储指令更新的,而不是分别写入不同的字节.这是std::atomic保证的一部分,但普通C或C ++则不保证.不过,通常是 . x86-64 System V ABI 不禁止来自使得对int变量的访问不是原子的,即使它确实要求int为4B且默认对齐方式为4B.例如,如果编译器需要,x = a<<16 | b可以编译为两个单独的16位存储.

First, this assumes that the int is updated with a single store instruction, rather than writing different bytes separately. This is part of what std::atomic guarantees, but that plain C or C++ doesn't. It will normally be the case, though. The x86-64 System V ABI doesn't forbid compilers from making accesses to int variables non-atomic, even though it does require int to be 4B with a default alignment of 4B. For example, x = a<<16 | b could compile to two separate 16-bit stores if the compiler wanted.

数据争用在C和C ++中都是未定义的行为,因此编译器可以并且确实假定内存不是异步修改的. 对于保证不会损坏的代码,请使用C11 stdatomic 或C ++ 11 std :: atomic .否则,编译器只会将值保留在寄存器 ,就像volatile一样,但每次重新阅读时都会重新加载.

Data races are Undefined Behaviour in both C and C++, so compilers can and do assume that memory is not asynchronously modified. For code that is guaranteed not to break, use C11 stdatomic or C++11 std::atomic. Otherwise the compiler will just keep a value in a register instead of reloading every time your read it, like volatile but with actual guarantees and official support from the language standard.

在C ++ 11之前,原子操作通常是用volatile或其他方式完成的,并且需要大量的我们关心的编译器工作",因此C ++ 11是向前迈出了一大步.现在,您不再需要担心编译器对普通int所做的工作.只需使用atomic<int>.如果您发现有关int原子性的旧指南,它们可能早于C ++ 11.

Before C++11, atomic ops were usually done with volatile or other things, and a healthy dose of "works on compilers we care about", so C++11 was a huge step forward. Now you no longer have to care about what a compiler does for plain int; just use atomic<int>. If you find old guides talking about atomicity of int, they probably predate C++11.

std::atomic<int> shared;  // shared variable (compiler ensures alignment)

int x;           // local variable (compiler can keep it in a register)
x = shared.load(std::memory_order_relaxed);
shared.store(x, std::memory_order_relaxed);
// shared = x;  // don't do that unless you actually need seq_cst, because MFENCE or XCHG is much slower than a simple store

旁注:对于大于CPU原子能执行的atomic<T>(因此.is_lock_free()为false),请参见

Side-note: for atomic<T> larger than the CPU can do atomically (so .is_lock_free() is false), see Where is the lock for a std::atomic?. int and int64_t / uint64_t are lock-free on all the major x86 compilers, though.

因此,我们只需要讨论像mov [shared], eax这样的insn的行为.

Thus, we just need to talk about the behaviour of an insn like mov [shared], eax.

TL; DR:x86 ISA保证自然对齐的存储和加载是原子的,最大宽度为64位.因此,编译器可以使用普通的存储/加载,只要它们确保std::atomic<T>具有自然的对齐方式.

TL;DR: The x86 ISA guarantees that naturally-aligned stores and loads are atomic, up to 64bits wide. So compilers can use ordinary stores/loads as long as they ensure that std::atomic<T> has natural alignment.

(但请注意,对于C11 _Atomic 64位类型,i386 gcc -m32无法做到这一点,只能将它们对齐到4B,因此atomic_llong实际上不是原子的. https通过更改<atomic>标头,在2015年修复了://gcc.gnu.org/bugzilla/show_bug.cgi?id = 65147 的问题.但这并没有改变C11的行为.)

(But note that i386 gcc -m32 fails to do that for C11 _Atomic 64-bit types, only aligning them to 4B, so atomic_llong is not actually atomic. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65146#c4). g++ -m32 with std::atomic is fine, at least in g++5 because https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65147 was fixed in 2015 by a change to the <atomic> header. That didn't change the C11 behaviour, though.)

IIRC中有SMP 386系统,但是直到486才建立了当前的内存语义.这就是为什么手册中写着"486及更高版本"的原因.

IIRC, there were SMP 386 systems, but the current memory semantics weren't established until 486. This is why the manual says "486 and newer".

摘自《英特尔®64和IA-32体系结构软件开发人员手册,第3卷》,并用斜体字标明我的注释. (有关链接,另请参见标签Wiki的链接:<所有的href ="https://www-ssl.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html" rel ="noreferrer">所有版本卷,或直接链接到

From the "Intel® 64 and IA-32 Architectures Software Developer Manuals, volume 3", with my notes in italics. (see also the x86 tag wiki for links: current versions of all volumes, or direct link to page 256 of the vol3 pdf from Dec 2015)

在x86术语中,字"是两个8位字节. 32位是双字或DWORD.

In x86 terminology, a "word" is two 8-bit bytes. 32 bits are a double-word, or DWORD.

第8.1.1节:保证原子操作

Intel486处理器(以及更新的处理器)保证以下基本内存 操作将始终以原子方式进行:

Section 8.1.1 Guaranteed Atomic Operations

The Intel486 processor (and newer processors since) guarantees that the following basic memory operations will always be carried out atomically:

  • 读取或写入字节
  • 读取或写入在16位边界上对齐的单词
  • 读取或写入在32位边界上对齐的双字 (这是自然对齐"的另一种表示方式)
  • Reading or writing a byte
  • Reading or writing a word aligned on a 16-bit boundary
  • Reading or writing a doubleword aligned on a 32-bit boundary (This is another way of saying "natural alignment")

我用粗体显示的最后一点是您的问题的答案:此行为是处理器成为x86 CPU(即ISA的实现)所必需的一部分.

That last point that I bolded is the answer to your question: This behaviour is part of what's required for a processor to be an x86 CPU (i.e. an implementation of the ISA).

本节的其余部分为较新的Intel CPU提供了进一步的保证:奔腾将这一保证扩展为64位.

The rest of the section provides further guarantees for newer Intel CPUs: Pentium widens this guarantee to 64 bits.

奔腾处理器(以及以后的较新处理器)保证了 以下附加存储操作将始终执行 原子地:

The Pentium processor (and newer processors since) guarantees that the following additional memory operations will always be carried out atomically:

  • 读取或写入在64位边界上对齐的四字 (例如,doublecmpxchg8b(在Pentium P5中是新的)的x87加载/存储)
  • 16位访问适合32位数据总线的未缓存内存位置.
  • Reading or writing a quadword aligned on a 64-bit boundary (e.g. x87 load/store of a double, or cmpxchg8b (which was new in Pentium P5))
  • 16-bit accesses to uncached memory locations that fit within a 32-bit data bus.

本节继续指出,不能保证跨缓存行(和页面边界)划分的访问是原子访问,并且:

The section goes on to point out that accesses split across cache lines (and page boundaries) are not guaranteed to be atomic, and:

可以使用以下命令来实现访问大于四字的数据的x87指令或SSE指令 多个内存访问."

"An x87 instruction or an SSE instructions that accesses data larger than a quadword may be implemented using multiple memory accesses."


AMD的手册与Intel关于对齐的64位和较窄的加载/存储是原子性的观点一致

因此,即使在32位或16位模式下(例如movqmovsdmovhpspinsrqextractps),整数,x87和MMX/SSE最多可以加载/存储64b. ,等等) 是原子的(如果数据已对齐). gcc -m32使用movq xmm, [mem]对诸如std::atomic<int64_t>之类的东西实现原子64位加载.不幸的是,Clang4.0 -m32使用lock cmpxchg8b 错误33109 .


AMD's manual agrees with Intel's about aligned 64-bit and narrower loads/stores being atomic

So integer, x87, and MMX/SSE loads/stores up to 64b, even in 32-bit or 16-bit mode (e.g. movq, movsd, movhps, pinsrq, extractps, etc) are atomic if the data is aligned. gcc -m32 uses movq xmm, [mem] to implement atomic 64-bit loads for things like std::atomic<int64_t>. Clang4.0 -m32 unfortunately uses lock cmpxchg8b bug 33109.

在某些具有128b或256b内部数据路径的CPU(执行单元和L1之间,以及不同的缓存之间)中,128b甚至256b向量加载/存储是原子的,但这不能保证对于实现std::atomic<__int128>或16B结构.

On some CPUs with 128b or 256b internal data paths (between execution units and L1, and between different caches), 128b and even 256b vector loads/stores are atomic, but this is not guaranteed by any standard or easily queryable at run-time, unfortunately for compilers implementing std::atomic<__int128> or 16B structs.

如果要在所有x86系统上使用atomic 128b,则必须使用lock cmpxchg16b(仅在64位模式下可用). (它在第一代x86-64 CPU中不可用.您需要将-mcx16与gcc/clang

If you want atomic 128b across all x86 systems, you must use lock cmpxchg16b (available only in 64bit mode). (And it wasn't available in the first-gen x86-64 CPUs. You need to use -mcx16 with gcc/clang for them to emit it.)

即使内部执行128b原子加载/存储的CPU在具有以较小块操作的一致性协议的多插槽系统中也可能表现出非原子行为: 带有线程的AMD Opteron 2435(K10)在与HyperTransport连接的单独套接字上运行.

Even CPUs that internally do atomic 128b loads/stores can exhibit non-atomic behaviour in multi-socket systems with a coherency protocol that operates in smaller chunks: e.g. AMD Opteron 2435 (K10) with threads running on separate sockets, connected with HyperTransport.

英特尔和AMD的手册因对可缓存内存的未对齐访问而有所不同.所有x86 CPU的通用子集是AMD规则.可缓存是指与PAT或MTRR区域一起设置的不可回写或写合并的回写或直写存储区域.它们并不意味着高速缓存行必须已经在L1高速缓存中处于高温状态.

Intel's and AMD's manuals diverge for unaligned access to cacheable memory. The common subset for all x86 CPUs is the AMD rule. Cacheable means write-back or write-through memory regions, not uncacheable or write-combining, as set with PAT or MTRR regions. They don't mean that the cache-line has to already be hot in L1 cache.

  • Intel P6和更高版本保证可缓存负载/存储的原子性,只要它们位于单个缓存行内(在像PentiumIII这样的非常老的CPU上为64B或32B)之内即可.
  • AMD保证适合单个8B对齐块中的可缓存加载/存储的原子性.这是有道理的,因为从多插槽Opteron的16B存储测试中我们知道,HyperTransport仅按8B块传输,并且在传输时不会锁定以防止撕裂. (看上面).我想lock cmpxchg16b必须特别处理.

可能相关:AMD使用 MOESI 直接在不同内核中的缓存之间共享脏缓存行,因此一个核心可以从其缓存行的有效副本中读取数据,而对其进行更新的则是从另一个缓存中获取.

Possibly related: AMD uses MOESI to share dirty cache-lines directly between caches in different cores, so one core can be reading from its valid copy of a cache line while updates to it are coming in from another cache.

Intel使用 MESIF ,这要求脏数据传播到大型共享的三级缓存作为一致性流量的支持. L3包含每核心L2/L1缓存的标记,即使对于由于每核心L1缓存中的M或E而必须在L3中处于Invalid状态的行也是如此.在Haswell/Skylake中,L3和每核高速缓存之间的数据路径只有32B宽,因此它必须进行缓冲或其他操作,以避免在读取两半高速缓存行之间发生一个内核对L3的写入,这可能会导致撕裂32B边界.

Intel uses MESIF, which requires dirty data to propagate out to the large shared inclusive L3 cache which acts as a backstop for coherency traffic. L3 is tag-inclusive of per-core L2/L1 caches, even for lines that have to be in the Invalid state in L3 because of being M or E in a per-core L1 cache. The data path between L3 and per-core caches is only 32B wide in Haswell/Skylake, so it must buffer or something to avoid a write to L3 from one core happening between reads of two halves of a cache line, which could cause tearing at the 32B boundary.

手册的相关部分:

P6系列处理器(和更新的Intel 处理器) 因为)确保以下额外的内存操作将 始终以原子方式进行:

The P6 family processors (and newer Intel processors since) guarantee that the following additional memory operation will always be carried out atomically:

  • 未对齐的16位,32位和64位对适合缓存行的缓存内存的访问.

AMD64手册7.3.2访问原子性
可缓存,自然对齐的单个加载或最多四字的存储在任何处理器上都是原子的 模型,以及未对齐的小于四字的负载或存储 完全包含在自然对齐的四字内

AMD64 Manual 7.3.2 Access Atomicity
Cacheable, naturally-aligned single loads or stores of up to a quadword are atomic on any processor model, as are misaligned loads or stores of less than a quadword that are contained entirely within a naturally-aligned quadword

请注意,AMD保证任何小于qword的负载都具有原子性,而Intel仅保证2的幂数大小. 32位保护模式和64位长模式可以使用

Notice that AMD guarantees atomicity for any load smaller than a qword, but Intel only for power-of-2 sizes. 32-bit protected mode and 64-bit long mode can load a 48 bit m16:32 as a memory operand into cs:eip with far-call or far-jmp. (And far-call pushes stuff on the stack.) IDK if this counts as a single 48-bit access or separate 16 and 32-bit.

有人试图将x86内存模型正式化,最新的是 2009年的x86-TSO(扩展版)论文(来自标签Wiki).由于它们定义了一些符号来以自己的符号表示事物,因此这不是有用的可略读操作,而且我还没有尝试过真正阅读它. IDK,如果它描述原子性规则,或者仅与内存排序有关.

There have been attempts to formalize the x86 memory model, the latest one being the x86-TSO (extended version) paper from 2009 (link from the memory-ordering section of the x86 tag wiki). It's not usefully skimable since they define some symbols to express things in their own notation, and I haven't tried to really read it. IDK if it describes the atomicity rules, or if it's only concerned with memory ordering.

我提到了cmpxchg8b,但是我只说的是负载和存储分别是原子的(即没有撕裂",其中一半负载来自一个存储,另一半负载来自一个存储).另一个商店).

I mentioned cmpxchg8b, but I was only talking about the load and the store each separately being atomic (i.e. no "tearing" where one half of the load is from one store, the other half of the load is from a different store).

为防止在加载和存储之间 修改该内存位置的内容,您需要 lock cmpxchg8b,就像您需要整个读取-修改-写入都是原子的.还要注意,即使不带lockcmpxchg8b进行单个原子加载(和可选的存储),通常也不能将其用作具有期望=期望值的64b加载.如果内存中的值恰好符合您的期望,那么您将获得该位置的非原子性的read-modify-write.

To prevent the contents of that memory location from being modified between the load and the store, you need lock cmpxchg8b, just like you need lock inc [mem] for the entire read-modify-write to be atomic. Also note that even if cmpxchg8b without lock does a single atomic load (and optionally a store), it's not safe in general to use it as a 64b load with expected=desired. If the value in memory happens to match your expected, you'll get a non-atomic read-modify-write of that location.

lock前缀甚至可以使跨越缓存行或页面边界的未对齐访问成为原子,但是您不能将其与mov一起使用来进行未对齐的存储或加载原子.它只能与add [mem], eax这样的内存目标读-修改-写指令一起使用.

The lock prefix makes even unaligned accesses that cross cache-line or page boundaries atomic, but you can't use it with mov to make an unaligned store or load atomic. It's only usable with memory-destination read-modify-write instructions like add [mem], eax.

(lockxchg reg, [mem]中是隐式的,因此除非性能无关,否则不要将xchg与mem一起使用以保存代码大小或指令数.只有在想要时才使用它.内存障碍和/或原子交换,或者当代码大小是唯一重要的事情时,例如在引导扇区中.)

(lock is implicit in xchg reg, [mem], so don't use xchg with mem to save code-size or instruction count unless performance is irrelevant. Only use it when you want the memory barrier and/or the atomic exchange, or when code-size is the only thing that matters, e.g. in a boot sector.)

另请参阅: num ++是否可以原子表示'int num '?

从insn ref手册(Intel x86手册vol2),cmpxchg:

From the insn ref manual (Intel x86 manual vol2), cmpxchg:

此指令可以与LOCK前缀一起使用,以允许 原子执行的指令.为了简化界面 处理器的总线,目标操作数接收一个写周期 不考虑比较结果.目的地 如果比较失败,则将操作数写回;否则,来源 操作数被写入目标. (处理器从不产生 锁定的读操作,同时也不会产生锁定的写操作.)

This instruction can be used with a LOCK prefix to allow the instruction to be executed atomically. To simplify the interface to the processor’s bus, the destination operand receives a write cycle without regard to the result of the comparison. The destination operand is written back if the comparison fails; otherwise, the source operand is written into the destination. (The processor never produces a locked read without also producing a locked write.)

在将存储器控制器内置到CPU中之前,该设计决定降低了芯片组的复杂性.对于敲打PCI-express总线而不是DRAM的MMIO区域中的lock指令,它仍然可能会这样做. lock mov reg, [MMIO_PORT]产生对内存映射的I/O寄存器的写入和读取操作会令人困惑.

This design decision reduced chipset complexity before the memory controller was built into the CPU. It may still do so for locked instructions on MMIO regions that hit the PCI-express bus rather than DRAM. It would just be confusing for a lock mov reg, [MMIO_PORT] to produce a write as well as a read to the memory-mapped I/O register.

另一种解释是,确保数据自然对齐并不是很困难,与仅确保数据对齐相比,lock store的表现将非常糟糕.将晶体管花费在如此缓慢以至于不值得使用的事情上,这将是愚蠢的.如果确实需要它(也不必介意读取内存),则可以使用xchg [mem], reg(XCHG具有隐式LOCK前缀),它甚至比假设的lock mov还要慢.

The other explanation is that it's not very hard to make sure your data has natural alignment, and lock store would perform horribly compared to just making sure your data is aligned. It would be silly to spend transistors on something that would be so slow it wouldn't be worth using. If you really need it (and don't mind reading the memory too), you could use xchg [mem], reg (XCHG has an implicit LOCK prefix), which is even slower than a hypothetical lock mov.

使用lock前缀也是一个完整的内存屏障,因此,它会带来超出原子RMW的性能开销.即x86无法执行宽松的原子RMW(不刷新存储缓冲区).其他ISA可以,因此在非x86上使用.fetch_add(1, memory_order_relaxed)可以更快.

Using a lock prefix is also a full memory barrier, so it imposes a performance overhead beyond just the atomic RMW. i.e. x86 can't do relaxed atomic RMW (without flushing the store buffer). Other ISAs can, so using .fetch_add(1, memory_order_relaxed) can be faster on non-x86.

有趣的事实:在mfence存在之前,一个常见的习惯用法是lock add dword [esp], 0,除了破坏标志和进行锁定操作之外,其他操作都是禁止的. [esp]在L1缓存中几乎总是很热,不会引起与其他任何内核的争用.作为独立的内存屏障,这种习惯用法可能仍然比MFENCE更有效,尤其是在AMD CPU上.

Fun fact: Before mfence existed, a common idiom was lock add dword [esp], 0, which is a no-op other than clobbering flags and doing a locked operation. [esp] is almost always hot in L1 cache and won't cause contention with any other core. This idiom may still be more efficient than MFENCE as a stand-alone memory barrier, especially on AMD CPUs.

xchg [mem], reg可能是在Intel和AMD上实现顺序一致性存储的最有效方法.在Skylake上至少 mfence非内存指令的乱序执行,但xchg和其他lock ed操作不这样做. gcc以外的编译器也将xchg用于存储,即使它们不在乎关于读取旧值.

xchg [mem], reg is probably the most efficient way to implement a sequential-consistency store, vs. mov+mfence, on both Intel and AMD. mfence on Skylake at least blocks out-of-order execution of non-memory instructions, but xchg and other locked ops don't. Compilers other than gcc do use xchg for stores, even when they don't care about reading the old value.

没有它,软件将不得不使用1字节锁(或某种可用的原子类型)来保护对32位整数的访问,与共享原子读取访问相比,对于诸如更新的全局时间戳变量之类的东西而言,效率非常低下计时器中断.它可能基本上不含硅,以保证总线宽度或更小尺寸的对齐访问.

Without it, software would have to use 1-byte locks (or some kind of available atomic type) to guard accesses to 32bit integers, which is hugely inefficient compared to shared atomic read access for something like a global timestamp variable updated by a timer interrupt. It's probably basically free in silicon to guarantee for aligned accesses of bus-width or smaller.

要想完全实现锁定,就需要某种原子访问. (实际上,我猜硬件可以提供某种完全不同的硬件辅助锁定机制.)对于在其外部数据总线上进行32位传输的CPU,将其作为原子性单元是很有意义的.

For locking to be possible at all, some kind of atomic access is required. (Actually, I guess the hardware could provide some kind of totally different hardware-assisted locking mechanism.) For a CPU that does 32bit transfers on its external data bus, it just makes sense to have that be the unit of atomicity.

由于您提供了赏金,所以我想您一直在寻找一个漫长的答案,该答案徘徊在所有有趣的副主题中.让我知道是否有我没有讲到的东西,您认为这将使本《问答》对未来的读者更有价值.

Since you offered a bounty, I assume you were looking for a long answer that wandered into all interesting side topics. Let me know if there are things I didn't cover that you think would make this Q&A more valuable for future readers.

由于您已将问题中的一个链接我强烈建议阅读更多Jeff Preshing的博客文章.它们非常出色,并帮助我将自己所了解的知识综合在一起,以了解C/C ++源代码与不同硬件体系结构的asm的内存排序,以及如何/何时告诉编译器如果您不想要的话.不要直接写asm.

Since you linked one in the question, I highly recommend reading more of Jeff Preshing's blog posts. They're excellent, and helped me put together the pieces of what I knew into an understanding of memory ordering in C/C++ source vs. asm for different hardware architectures, and how / when to tell the compiler what you want if you aren't writing asm directly.

这篇关于为什么在x86上对自然对齐的可变原子进行整数赋值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆