为什么整数分配对自然对齐的变量原子? [英] Why is integer assignment on a naturally aligned variable atomic?

查看:324
本文介绍了为什么整数分配对自然对齐的变量原子?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在阅读关于原子操作的文章,它提到了32位整数



为什么自然对齐保证原子性?

解决方案

自然对齐意味着对齐它自己的类型宽度。因此,加载/存储将永远不会跨越任何种类的边界(比如高速缓存行或页面)分割。






x86 C或C ++编译器编译32位变量的唯一合理方法是使用单个存储指令。它可以一次存储一个字节,但我们可以确定它不是。 SystemV AMD64 ABI 并未特别禁止不正当的编译器访问 int 变量非原子,即使它需要 int 为4B,默认对齐为4B。



数据竞争是C和C ++中的未定义行为,因此编译器可以并假设内存不是异步修改。 对于确保不会中断,并实际存储或加载而不是将值缓存在寄存器中的代码,请使用C11 stdatomic 或C ++ 11 std :: atomic

  std :: atomic< int>共享; //共享变量(在对齐的内存中)

int x; //局部变量(编译器可以保存在寄存器中)
x = shared.load(std :: memory_order_relaxed);
shared.store(x,std :: memory_order_relaxed);
// shared = x; //不这样做,默认是seq_cst所以存储需要MFENCE

谈论像 mov [shared],eax 这样的insn的行为。






TL; DR:x86 ISA保证这些存储和加载是原子的,高达64位宽。因此,只要我们确保编译器生成这些。 p>




IIRC,有SMP 386系统,但是当前的内存语义直到486才建立。这就是为什么手册说



从英特尔®64和IA-32架构软件开发人员手册第3卷中,我的注释以斜体显示

em>。 (另请参阅 x86 标记维基的链接:< a href =https://www-ssl.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html =nofollow>所有版本的当前版本卷或直接链接到 vol3 pdf的第256页,来自2015年12月



在x86术语中,字是两个字节。 32位是双字或DWORD。


第8.1.1节保证原子操作



Intel486处理器(以及更新的处理器)保证以下基本内存
操作将始终以原子方式执行:




  • 读取或
    写入一个字节

  • 读取或写入在16位边界对齐的字符

  • 或写一个在32位边界上对齐的双字(这是另一种说自然对齐的方式)


最后一点,我加粗是你的问题的答案:这种行为是处理器是一个x86 CPU所需的一部分(即ISA的实现)。






本节的其余部分为更新的CPU提供了进一步的保证。我没有阅读AMD手册,但是据我所知,现代AMD CPU提供的保证至少与英特尔为P6记录的一样强。有尝试正式化x86内存模型,最新的一个是 2009年的x86-TSO(扩展版本)文章(来自 x86 标记维基)。这是没有用的可跳过,因为他们定义一些符号来表达的东西在自己的符号,我没有试图真正阅读它。



Pentium处理器(以及更新的处理器)保证在额外的存储器操作之后的
总是执行
atomically:




  • 读取或写入在64位
    边界上对齐的四字符(例如x87加载/存储 double cmpxchg8b 这是Pentium(P5)中的新功能)

  • 16位访问非缓存内存位置b $ ba 32位数据总线。



P6系列处理器(以及更新的处理器
)额外的内存操作将
始终以原子方式执行:




  • 未缓存的16位,32位和64位访问适合在缓存线内的存储器。 (即回写或直写存储器区域,不是不可缓存或写合并它们不意味着高速缓存行已经在L1高速缓存中热)


该节还指出,跨高速缓存行(和页边界)分割的访问不能保证是原子,和:


x87指令或访问大于四字的数据的SSE指令可以使用
多内存访问


因此64位x87和MMX / SSE加载/存储高达64b(例如 movsd movq movhps pinsrq extractps 等)是原子的,如果所有数据来自或去到同一缓存线。在执行单元和L1高速缓存之间具有128b或256b数据路径的一些CPU上,128b和甚至256b向量加载/存储是原子的,但是这不是任何标准保证的。如果要在所有x86系统上使用原子128b,必须使用 cmpxchg16b (仅在64位模式下可用)。



甚至在内部做原子128b加载/存储的CPU可以在具有以较小块操作的一致性协议的多插座系统中展示非原子行为:例如 AMD Opteron 2435的线程单独运行






原子读取 - 修改写入



我提到了 cmpxchg8b ,但我只是说负载和存储每个分别是原子的(即没有撕裂,其中一半负载来自一个存储,另一半负载来自不同的存储)。



为了防止该存储单元的内容被修改加载和存储,你需要 lock cmpxchg8b ,就像你需要 lock inc [mem]



lock 前缀使得甚至未对齐的访问交叉缓存 - 行或页边界原子,但不能使用它与 mov 进行未对齐的存储或加载原子。它只能用于内存 - 目标读 - 修改 - 写指令,如 add [mem],eax



lock 隐含在 xchg reg,[mem] 中,因此不要尝试保存代码大小或指令计数如果你关心性能,只有当你需要内存屏障和原子效应,或者当代码大小是唯一重要的事情,例如在引导扇区)。



另请参阅:






为什么 lock mov对于原子未对齐的存储,不存在



从pnn ref手册(Intel x86手册vol2) $ c> cmpxchg


此指令可用于 LOCK 前缀以允许
指令以原子方式执行。为了简化处理器总线
的接口,目的操作数接收一个写周期
而不考虑比较的结果。如果比较失败,则回写目的地
操作数;否则,源
操作数被写入目的地。 (处理器从不生成
a锁定读取,而且不会产生锁定写入
。)


设计决策在将存储器控制器构建到CPU中之前降低了芯片组的复杂性。它仍然可以在击中PCI-express总线而不是DRAM的MMIO区域上执行 lock ed指令。对于锁mov reg,[MMIO_PORT] 来产生写操作以及对存储器映射的I / O寄存器的读操作,这将是混乱的。



另一个解释是,确保您的数据具有自然对齐并不是很难,并且锁存储会执行可怕的只是确保您的数据是对齐的。把晶体管花在那么慢,不值得使用的东西是愚蠢的。如果你真的需要它(也不介意读取内存),你可以使用 xchg [mem],reg (XCHG有一个隐式LOCK前缀)甚至比假设的锁mov 更慢。



使用前缀也是一个完整的内存屏障,因此它强加了超出原子RMW的性能开销。 (有趣的是: mfence 之前存在一个常见的成语是 lock add [esp],0 [esp] 在L1缓存中几乎总是热,并且不会引起与任何其他内核的争用这个习语可能仍然在AMD CPU上比MFENCE更高效。)






此设计决策的动机:


如果没有它,软件将不得不使用1字节锁(或某种可用的原子类型)来保护对32位整数的访问,这与共享原子读访问相比非常低效,例如全局通过定时器中断更新的时间戳变量。它可能基本上是免费的硅,以保证总线宽度或更小的对齐访问。



为了锁定是可能的,需要某种原子访问。 (实际上,我猜硬件可以提供某种完全不同的硬件辅助锁定机制。)对于在其外部数据总线上执行32位传输的CPU,只要有原子性的单位就是有意义的。






由于你提供了赏金,我认为你正在寻找一个漫长的答案,徘徊在所有有趣的侧面话题。让我知道,如果有些事情我没有涵盖,你认为这将使这个问答对未来的读者更有价值。



BTW,我强烈推荐阅读更多Jeff Preshing的博客帖子。他们是优秀的,并帮助我把我所知道的部分放在一起了解C / C ++源代码与不同的硬件架构的内存排序,以及如何/何时告诉编译器你想要什么,如果你不是直接写asm。


I've been reading this article about atomic operations, and it mentions 32bit integer assignment being atomic on x86, as long as the variable is naturally aligned.

Why does natural alignment assure atomicity?

解决方案

"Natural" alignment means aligned to it's own type width. Thus, the load/store will never be split across any kind of boundary wider than itself (e.g. cache-line or page).


The only sane way for an x86 C or C++ compiler to compile an assignment to a 32bit variable is with a single store instruction. It could store the bytes one at a time, but we can be sure it doesn't. The SystemV AMD64 ABI doesn't specifically forbid perverse compilers from making accesses to int variables non-atomic, even though it does require int to be 4B with a default alignment of 4B.

Data races are Undefined Behaviour in both C and C++, so compilers can and do assume that memory is not asynchronously modified. For code that is guaranteed not to break, and actually stores or loads instead of keeping values cached in registers, use C11 stdatomic or C++11 std::atomic.

std::atomic<int> shared;  // shared variable (in aligned memory)

int x;  // local variable (compiler can keep it in a register)
x = shared.load(std::memory_order_relaxed);
shared.store(x, std::memory_order_relaxed);
// shared = x;  // don't do that, the default is seq_cst so stores need MFENCE

Thus, we just need to talk about the behaviour of an insn like mov [shared], eax.


TL;DR: The x86 ISA guarantees that such stores and loads are atomic, up to 64bits wide. So we're fine as long as we ensure the compiler generates those.


IIRC, there were SMP 386 systems, but the current memory semantics weren't established until 486. This is why the manual says "486 and newer".

From the "Intel® 64 and IA-32 Architectures Software Developer Manuals, volume 3", with my notes in italics. (see also the tag wiki for links: current versions of all volumes, or direct link to page 256 of the vol3 pdf from Dec 2015)

In x86 terminology, a "word" is two bytes. 32bits are a double-word, or DWORD.

Section 8.1.1 Guaranteed Atomic Operations

The Intel486 processor (and newer processors since) guarantees that the following basic memory operations will always be carried out atomically:

  • Reading or writing a byte
  • Reading or writing a word aligned on a 16-bit boundary
  • Reading or writing a doubleword aligned on a 32-bit boundary (This is another way of saying "natural alignment")

That last point that I bolded is the answer to your question: This behaviour is part of what's required for a processor to be an x86 CPU (i.e. an implementation of the ISA).


The rest of the section provides further guarantees for newer CPUs. I haven't read the AMD manuals, but as I understand it, modern AMD CPUs provide guarantees at least as strong as those documented by Intel for P6. There have been attempts to formalize the x86 memory model, the latest one being the x86-TSO (extended version) paper from 2009 (link from the memory-ordering section of the tag wiki). It's not usefully skimable since they define some symbols to express things in their own notation, and I haven't tried to really read it.

The Pentium processor (and newer processors since) guarantees that the following additional memory operations will always be carried out atomically:

  • Reading or writing a quadword aligned on a 64-bit boundary (e.g. x87 load/store of a double, or cmpxchg8b which was new in Pentium (P5))
  • 16-bit accesses to uncached memory locations that fit within a 32-bit data bus.

The P6 family processors (and newer processors since) guarantee that the following additional memory operation will always be carried out atomically:

  • Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit within a cache line. (i.e. write-back or write-through memory regions, not uncacheable or write-combining. They don't mean that the cache-line has to already be hot in L1 cache)

The section goes on to point out that accesses split across cache lines (and page boundaries) are not guaranteed to be atomic, and:

"An x87 instruction or an SSE instructions that accesses data larger than a quadword may be implemented using multiple memory accesses."

So 64bit x87 and MMX/SSE loads/stores up to 64b (e.g. movsd, movq, movhps, pinsrq, extractps, etc) are atomic if all the data comes from or goes to the same cache line. On some CPUs with 128b or 256b data paths between execution units and L1 cache, 128b and even 256b vector loads/stores are atomic, but this is not guaranteed by any standard. If you want atomic 128b across all x86 systems, you must use cmpxchg16b (available only in 64bit mode).

Even CPUs that internally do atomic 128b loads/stores can exhibit non-atomic behaviour in multi-socket systems with a coherency protocol that operates in smaller chunks: e.g. AMD Opteron 2435 with threads running on separate sockets, connected with HyperTransport.


Atomic Read-Modify-Write

I mentioned cmpxchg8b, but I was only talking about the load and the store each separately being atomic (i.e. no "tearing" where one half of the load is from one store, the other half of the load is from a different store).

To prevent the contents of that memory location from being modified between the load and the store, you need lock cmpxchg8b, just like you need lock inc [mem] for the entire read-modify-write to be atomic.

The lock prefix makes even unaligned accesses that cross cache-line or page boundaries atomic, but you can't use it with mov to make an unaligned store or load atomic. It's only usable with memory-destination read-modify-write instructions like add [mem], eax.

(lock is implicit in xchg reg, [mem], so don't try to save code-size or instruction count with it if you care about performance. Only use it when you want the memory barrier and atomic effect, or when code-size is the only thing that matters, e.g. in a boot sector.)

See also: Can num++ be atomic for 'int num'?


Why lock mov [mem], reg doesn't exist for atomic unaligned stores

From the insn ref manual (Intel x86 manual vol2), cmpxchg:

This instruction can be used with a LOCK prefix to allow the instruction to be executed atomically. To simplify the interface to the processor’s bus, the destination operand receives a write cycle without regard to the result of the comparison. The destination operand is written back if the comparison fails; otherwise, the source operand is written into the destination. (The processor never produces a locked read without also producing a locked write.)

This design decision reduced chipset complexity before the memory controller was built into the CPU. It may still do so for locked instructions on MMIO regions that hit the PCI-express bus rather than DRAM. It would just be confusing for a lock mov reg, [MMIO_PORT] to produce a write as well as a read to the memory-mapped I/O register.

The other explanation is that it's not very hard to make sure your data has natural alignment, and lock store would perform horribly compared to just making sure your data is aligned. It would be silly to spend transistors on something that would be so slow it wouldn't be worth using. If you really need it (and don't mind reading the memory too), you could use xchg [mem], reg (XCHG has an implicit LOCK prefix), which is even slower than a hypothetical lock mov.

Using a lock prefix is also a full memory barrier, so it imposes a performance overhead beyond just the atomic RMW. (Fun fact: Before mfence existed, a common idiom was lock add [esp], 0, which is a no-op other than clobbering flags and doing a locked operation. [esp] is almost always hot in L1 cache and won't cause contention with any other core. This idiom may still be more efficient than MFENCE on AMD CPUs.)


Motivation for this design decision:

Without it, software would have to use 1-byte locks (or some kind of available atomic type) to guard accesses to 32bit integers, which is hugely inefficient compared to shared atomic read access for something like a global timestamp variable updated by a timer interrupt. It's probably basically free in silicon to guarantee for aligned accesses of bus-width or smaller.

For locking to be possible at all, some kind of atomic access is required. (Actually, I guess the hardware could provide some kind of totally different hardware-assisted locking mechanism.) For a CPU that does 32bit transfers on its external data bus, it just makes sense to have that be the unit of atomicity.


Since you offered a bounty, I assume you were looking for a long answer that wandered into all interesting side topics. Let me know if there are things I didn't cover that you think would make this Q&A more valuable for future readers.

BTW, I highly recommend reading more of Jeff Preshing's blog posts. They're excellent, and helped me put together the pieces of what I knew into an understanding of memory ordering in C/C++ source vs. for different hardware architectures, and how / when to tell the compiler what you want if you aren't writing asm directly.

这篇关于为什么整数分配对自然对齐的变量原子?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆