为什么具有顺序一致性的std :: atomic存储区使用XCHG? [英] Why does a std::atomic store with sequential consistency use XCHG?
问题描述
为什么 std::atomic
的 xchg
吗?
从技术上讲,具有读/写内存屏障的普通存储就足够了吗?等同于:
_ReadWriteBarrier(); // Or `asm volatile("" ::: "memory");` for gcc/clang
my_atomic.store(1, std::memory_order_acquire);
我明确地谈论的是x86& x86_64.商店具有隐式获取栅栏的地方.
mov
-store + mfence
和xchg
都是在x86上实现顺序一致性存储的有效方法.带有内存的xchg
上的隐式lock
前缀使其成为完整的内存屏障,就像x86上的所有原子RMW操作一样. (不幸的是,对于其他用例,x86并没有提供一种轻松的或acq_rel原子增量的方法,仅提供seq_cst.)
简单的mov
是不够的;它仅具有发布语义,而没有顺序发布. (与AArch64的stlr
指令不同,它执行顺序释放存储.此选择显然是由于C ++ 11将seq_cst作为默认内存顺序而引起的.但是AArch64的常规存储要弱得多;放宽不释放.)请参见 Jeff Preshing有关获取/发布语义的文章,请注意,常规发布允许重新排序与以后的操作. (如果发布商店正在释放锁,则可以在关键部分内部出现以后的内容.)
在不同的CPU上,mfence
和xchg
之间的性能存在差异,可能在热缓存与冷缓存以及竞争与非竞争的情况下.和/或在同一个线程中背靠背执行多个操作的吞吐量,而不是一个线程,并且允许周围的代码与原子操作重叠执行.
请参见 https://shipilev.net/blog/2014/mfence
vs. lock addl $0, -8(%rsp)
vs. (%rsp)
的实际基准的依赖关系(如果没有商店的话). >
在Intel Skylake硬件上, mfence
阻止无序执行独立的ALU指令,但xchg
不会. (请参阅我的测试汇编+结果这样的答案的底部).英特尔的手册并不要求它具有如此强大的功能.仅记录了lfence
来做到这一点.但是,作为实现细节,在Skylake上乱序执行周围的代码非常昂贵.
我尚未测试其他CPU,这可能是针对勘误表SKL079的微代码修复 ,WC内存中的 SKL079 MOVNTDQA可能会通过早期的 MFENCE指令.勘误表的存在基本上证明了SKL在MFENCE之后能够执行指令.如果他们通过在微码中增强MFENCE来解决问题,我不会感到惊讶,这是一种钝器手段,可显着增加对周围代码的影响.
我只测试了单线程情况,即L1d缓存中的缓存行很热. (当内存变冷或在另一个内核上处于Modified状态时,则不需要.)xchg
必须加载先前的值,从而对内存中的旧值创建"false"依赖项.但是mfence
强制CPU等待,直到先前的存储提交到L1d,这也需要高速缓存行到达(并处于M状态).因此,它们在这方面可能大致相等,但是英特尔的mfence
强制所有内容等待,而不仅仅是加载.
AMD的优化手册建议xchg
用于原子seq-cst存储.我以为英特尔推荐了gcc使用的mov
+ mfence
,但是英特尔的编译器在这里也使用了xchg
.
当我进行测试时,在相同位置的单线程循环中,对于xchg
而言,与mov
+ mfence
相比,在Skylake上获得的吞吐量要好得多.有关某些详细信息,请参见 Agner Fog的微体系结构指南和说明表,但他不会花费很多时间进行锁定操作
查看 GCC/铛/ICC/MSVC上输出的Godbolt编译探险对于C ++ 11 seq-cst my_atomic = 4;
,当SSE2可用时,gcc使用mov
+ mfence
. (使用-m32 -mno-sse2
也可以使gcc也使用xchg
).其他3个编译器都更喜欢使用xchg
进行默认调整,或者使用znver1
(Ryzen)或skylake
.
Linux内核将xchg
用于__smp_store_mb()
.
因此,似乎gcc应该使用xchg
,除非它们有一些其他人不知道的基准测试结果.
另一个有趣的问题是如何编译atomic_thread_fence(mo_seq_cst);
.最明显的选项是mfence
,但是lock or dword [rsp], 0
是另一个有效的选项(当MFENCE不可用时,由gcc -m32
使用).堆栈的底部通常在M状态的缓存中已经很热.缺点是如果本地存储在本地,则会引入延迟. (如果只是返回地址,则返回地址预测通常非常好,因此延迟ret
的读取能力并不成问题.)因此在某些情况下lock or dword [rsp-4], 0
值得考虑. ( gcc确实考虑过,但由于它使valgrind感到不快而将其还原.在此之前知道即使mfence
可用,它也可能比mfence
更好.)
当前可用的所有编译器都将mfence
用作独立的屏障.这些在C ++ 11代码中很少见,但是对于真正的多线程代码实际上是最有效的,需要进行更多的研究,而真正的多线程代码却在无锁通信的线程中进行着实际的工作.
但是,有多个消息来源建议使用lock add
作为堆栈的屏障,而不是mfence
,因此Linux内核最近切换到将其用于x86上的smp_mb()
实现,即使在SSE2可用.
请参见 https://groups.google.com/d/msg/fa.linux.kernel/hNOoIZc6I9E/pVO3hB5ABAAJ 进行了一些讨论,其中包括针对HSW/BDW的一些勘误表,涉及通过早期lock
指令从WC内存加载movntdqa
的情况. (在Skylake的对面,这是一个问题,而不是mfence
指令.但是与SKL不同,微代码中没有修复程序.这可能就是Linux仍将mfence
用作其mb()
驱动程序的原因,以防万一任何东西都使用NT负载从视频RAM或其他东西复制回来,但直到可见较早的存储后才能让读取发生.)
-
在Linux中4.14 ,
smp_mb()
使用mb()
.如果可用,则使用mfence,否则使用lock addl $0, 0(%esp)
.__smp_store_mb
(存储+内存屏障)使用xchg
(在以后的内核中不会更改). -
在Linux中4.15 ,
驱动程序使用smb_mb()
使用lock; addl $0,-4(%esp)
或%rsp
,而不是使用mb()
. (即使在64位中,内核也不使用红色区域,因此-4
可能有助于避免本地var的额外延迟.)mb()
来订购对MMIO区域的访问,但是为单处理器系统进行编译时,smp_mb()
变为无操作.更改mb()
风险更大,因为它更难测试(影响驱动程序),并且CPU具有与锁vs.mfence有关的勘误表.但是无论如何,mb()
使用mfence(如果可用),否则使用lock addl $0, -4(%esp)
.唯一的变化是-4
. - 在Linux 4.16中,没有任何更改,只是删除了
#if defined(CONFIG_X86_PPRO_FENCE)
,它定义了比现代硬件实现的x86-TSO模型更弱的内存模型.
x86& x86_64.商店具有隐式获取栅栏的地方
我希望您的意思是 release . my_atomic.store(1, std::memory_order_acquire);
不会编译,因为只写原子操作不能是获取操作.另请参见 Jeff Preshing的关于获取/发布语义的文章.
或
asm volatile("" ::: "memory");
不,那只是编译器的障碍;它会阻止所有编译时重新排序,但是不会. t防止运行时StoreLoad重新排序,即商店直到以后才被缓冲,直到以后的加载才以全局顺序出现. (StoreLoad是x86允许的唯一一种重新排序运行时.)
无论如何,在这里表达您想要的另一种方式是:
my_atomic.store(1, std::memory_order_release); // mov
// with no operations in between, there's nothing for the release-store to be delayed past
std::atomic_thread_fence(std::memory_order_seq_cst); // mfence
使用释放栅栏不够强大(它和释放库都可能延迟到以后的加载之后,这与说释放栅栏不能阻止以后的加载尽早发生是一样的事情).但是,通过发布获取隔离区可以解决问题,可以避免以后的加载过早发生,并且本身无法通过发布存储重新排序.
相关: Jeff Preshing的栅栏上的文章与发布操作不同.
但是请注意,根据C ++ 11规则seq-cst是特殊的:仅保证seq-cst操作具有单个全局/总顺序,所有线程都同意此顺序.因此,在C ++抽象机上,即使它们在x86上,用较弱的有序+栅栏来模拟它们通常也不完全相同. (在x86上,所有存储都具有一个所有内核都同意的单一总订单.另请参见 Globally Invisible加载指令:负载可以从存储缓冲区中获取其数据,因此我们不能说负载+存储总顺序.)
Why is std::atomic
's store
:
std::atomic<int> my_atomic;
my_atomic.store(1, std::memory_order_seq_cst);
doing an xchg
when a store with sequential consistency is requested?
Shouldn't, technically, a normal store with a read/write memory barrier be enough? Equivalent to:
_ReadWriteBarrier(); // Or `asm volatile("" ::: "memory");` for gcc/clang
my_atomic.store(1, std::memory_order_acquire);
I'm explicitly talking about x86 & x86_64. Where a store has an implicit acquire fence.
mov
-store + mfence
and xchg
are both valid ways to implement a sequential-consistency store on x86. The implicit lock
prefix on an xchg
with memory makes it a full memory barrier, like all atomic RMW operations on x86. (Unfortunately for other use-cases, x86 doesn't provide a way to do a relaxed or acq_rel atomic increment, only seq_cst.)
Plain mov
is not sufficient; it only has release semantics, not sequential-release. (Unlike AArch64's stlr
instruction, which does do a sequential-release store. This choice is obviously motivated by C++11 having seq_cst as the default memory ordering. But AArch64's normal store is much weaker; relaxed not release.) See Jeff Preshing's article on acquire / release semantics, and note that regular release allows reordering with later operations. (If the release-store is releasing a lock, it's ok for later stuff to appear to happen inside the critical section.)
There are performance differences between mfence
and xchg
on different CPUs, and maybe in the hot vs. cold cache and contended vs. uncontended cases. And/or for throughput of many operations back to back in the same thread vs. for one on its own, and for allowing surrounding code to overlap execution with the atomic operation.
See https://shipilev.net/blog/2014/on-the-fence-with-dependencies for actual benchmarks of mfence
vs. lock addl $0, -8(%rsp)
vs. (%rsp)
as a full barrier (when you don't already have a store to do).
On Intel Skylake hardware, mfence
blocks out-of-order execution of independent ALU instructions, but xchg
doesn't. (See my test asm + results in the bottom of this SO answer). Intel's manuals don't require it to be that strong; only lfence
is documented to do that. But as an implementation detail, it's very expensive for out-of-order execution of surrounding code on Skylake.
I haven't tested other CPUs, and this may be a result of a microcode fix for erratum SKL079, SKL079 MOVNTDQA From WC Memory May Pass Earlier MFENCE Instructions. The existence of the erratum basically proves that SKL used to be able to execute instructions after MFENCE. I wouldn't be surprised if they fixed it by making MFENCE stronger in microcode, kind of a blunt instrument approach that significantly increases the impact on surrounding code.
I've only tested the single-threaded case where the cache line is hot in L1d cache. (Not when it's cold in memory, or when it's in Modified state on another core.) xchg
has to load the previous value, creating a "false" dependency on the old value that was in memory. But mfence
forces the CPU to wait until previous stores commit to L1d, which also requires the cache line to arrive (and be in M state). So they're probably about equal in that respect, but Intel's mfence
forces everything to wait, not just loads.
AMD's optimization manual recommends xchg
for atomic seq-cst stores. I thought Intel recommended mov
+ mfence
, which gcc uses, but Intel's compiler also uses xchg
here.
When I tested, I got better throughput on Skylake for xchg
than for mov
+mfence
in a single-threaded loop on the same location repeatedly. See Agner Fog's microarch guide and instruction tables for some details, but he doesn't spend much time on locked operations.
See gcc/clang/ICC/MSVC output on the Godbolt compiler explorer for a C++11 seq-cst my_atomic = 4;
gcc uses mov
+ mfence
when SSE2 is available. (use -m32 -mno-sse2
to get gcc to use xchg
too). The other 3 compilers all prefer xchg
with default tuning, or for znver1
(Ryzen) or skylake
.
The Linux kernel uses xchg
for __smp_store_mb()
.
So it appears that gcc should be using xchg
, unless they have some benchmark results that nobody else knows about.
Another interesting question is how to compile atomic_thread_fence(mo_seq_cst);
. The obvious option is mfence
, but lock or dword [rsp], 0
is another valid option (and used by gcc -m32
when MFENCE isn't available). The bottom of the stack is usually already hot in cache in M state. The downside is introducing latency if a local was stored there. (If it's just a return address, return-address prediction is usually very good so delaying ret
's ability to read it is not much of a problem.) So lock or dword [rsp-4], 0
could be worth considering in some cases. (gcc did consider it, but reverted it because it makes valgrind unhappy. This was before it was known that it might be better than mfence
even when mfence
was available.)
All compilers currently use mfence
for a stand-alone barrier when it's available. Those are rare in C++11 code, but more research is needed on what's actually most efficient for real multi-threaded code that has real work going on inside the threads that are communicating locklessly.
But multiple source recommend using lock add
to the stack as a barrier instead of mfence
, so the Linux kernel recently switched to using it for the smp_mb()
implementation on x86, even when SSE2 is available.
See https://groups.google.com/d/msg/fa.linux.kernel/hNOoIZc6I9E/pVO3hB5ABAAJ for some discussion, including a mention of some errata for HSW/BDW about movntdqa
loads from WC memory passing earlier lock
ed instructions. (Opposite of Skylake, where it was mfence
instead of lock
ed instructions that were a problem. But unlike SKL, there's no fix in microcode. This may be why Linux still uses mfence
for its mb()
for drivers, in case anything ever uses NT loads to copy back from video RAM or something but can't let the reads happen until after an earlier store is visible.)
In Linux 4.14,
smp_mb()
usesmb()
. That uses mfence is used if available, otherwiselock addl $0, 0(%esp)
.__smp_store_mb
(store + memory barrier) usesxchg
(and that doesn't change in later kernels).In Linux 4.15,
smb_mb()
useslock; addl $0,-4(%esp)
or%rsp
, instead of usingmb()
. (The kernel doesn't use a red-zone even in 64-bit, so the-4
may help avoid extra latency for local vars).mb()
is used by drivers to order access to MMIO regions, butsmp_mb()
turns into a no-op when compiled for a uniprocessor system. Changingmb()
is riskier because it's harder to test (affects drivers), and CPUs have errata related to lock vs. mfence. But anyway,mb()
uses mfence if available, elselock addl $0, -4(%esp)
. The only change is the-4
.- In Linux 4.16, no change except removing the
#if defined(CONFIG_X86_PPRO_FENCE)
which defined stuff for a more weakly-ordered memory model than the x86-TSO model that modern hardware implements.
x86 & x86_64. Where a store has an implicit acquire fence
You mean release, I hope. my_atomic.store(1, std::memory_order_acquire);
won't compile, because write-only atomic operations can't be acquire operations. See also Jeff Preshing's article on acquire/release semantics.
Or
asm volatile("" ::: "memory");
No, that's a compiler barrier only; it prevents all compile-time reordering across it, but doesn't prevent runtime StoreLoad reordering, i.e. the store being buffered until later, and not appearing in the global order until after a later load. (StoreLoad is the only kind of runtime reordering x86 allows.)
Anyway, another way to express what you want here is:
my_atomic.store(1, std::memory_order_release); // mov
// with no operations in between, there's nothing for the release-store to be delayed past
std::atomic_thread_fence(std::memory_order_seq_cst); // mfence
Using a release fence would not be strong enough (it and the release-store could both be delayed past a later load, which is the same thing as saying that release fences don't keep later loads from happening early). A release-acquire fence would do the trick, though, keeping later loads from happening early and not itself being able to reorder with the release store.
Related: Jeff Preshing's article on fences being different from release operations.
But note that seq-cst is special according to C++11 rules: only seq-cst operations are guaranteed to have a single global / total order which all threads agree on seeing. So emulating them with weaker order + fences might not be exactly equivalent in general on the C++ abstract machine, even if it is on x86. (On x86, all store have a single total order which all cores agree on. See also Globally Invisible load instructions: Loads can take their data from the store buffer, so we can't really say that there's a total order for loads + stores.)
这篇关于为什么具有顺序一致性的std :: atomic存储区使用XCHG?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!