多线程原子a b为memory_order_relaxed打印00 [英] Multithreading atomics a b printing 00 for memory_order_relaxed
问题描述
在下面的代码中,对in foo的写操作存储在存储缓冲区中,并且对ra in bar不可见.同样,对foo中的rb不可见对bar中的b的写入,并且它们显示00.
//g ++ -O2 -pthread axbx.cpp;而[true];做./a.out |grep"00";1分钟内完成打印00#include< atomic>#include< thread>#include< cstdio>使用命名空间std;原子长a,b;长ra,rb;无效foo(){a.store(1,memory_order_relaxed);rb = b.load(memory_order_relaxed);}无效bar(){b.store(1,memory_order_relaxed);ra = a.load(memory_order_relaxed);}int main(){线程t [2] {线程(foo),线程(bar)};t [0] .join(); t [1] .join();if((ra == 0)&&(rb == 0))printf("00 \ n");//每个cpu存储缓冲区写入的内容对其他线程均不可见.}
下面的代码与上面的代码几乎相同,只是删除了变量b,并且foo和bar具有相同的变量'a',并且返回值存储在ra1和ra2中.在这种情况下,我永远不会得到"00".跑步5分钟后至少.
- 在第二种情况下,为什么不打印00?怎么写到x没有将两个线程都存储在cpu缓存中,然后输出00?
- 它与x86_64有什么关系,但在arm/arm64/power上显示00吗?
- 如果arm/arm64/power打印00,存储在foo和bar中的smp_mb()是否可以修复它?
//g ++ -O2 -pthread axbx.cpp;而[true];做./a.out |grep"00";完成在5分钟内不会打印00#include< atomic>#include< thread>#include< cstdio>使用命名空间std;原子长a,b;长ra1,ra2;无效foo(){a.store(1,memory_order_relaxed);ra1 = a.load(memory_order_relaxed);}无效bar(){a.store(1,memory_order_relaxed);ra2 = a.load(memory_order_relaxed);}int main(){线程t [2] {线程(foo),线程(bar)};t [0] .join(); t [1] .join();if((ra1 == 0)&&(ra2 == 0))printf("00 \ n");//每个cpu存储缓冲区写入的内容对其他线程均不可见.}
a.store(1,mo_relaxed)
在之前 a.load
在同一线程中(在foo和bar中),因此两个加载都必须看到该存储结果(或另一个更高的存储值).这使得任一加载都看不到初始0.
线程始终总是按程序顺序看到自己的操作,即使它们在使用 顺便说一句,实际上,您可以正确地将值设为已转发直接从存储缓冲区,然后才到达L1d高速缓存并变为全局可见.(因为您没有使用 因此在实践中,负载很可能会看到自己线程存储的 例如,您可以通过存储 您对为什么使用2个变量在您的版本中看到 在下面的代码中,对in foo的写操作存储在存储缓冲区中,并且对ra in bar不可见.同样,对foo中的rb不可见对bar中的b的写操作 是,存储缓冲区是StoreLoad重新排序的正常原因,并且如果 但是,存储缓冲区始终试图尽快耗尽自身资源,并将待处理的存储提交到全局可见的L1d.这就是为什么很少见到 对 (半相关:存储负载重新排序是性能上最重要"的,也是最昂贵的一个,例如x86 asm 始终阻止其他类型,即程序顺序+存储转发,因此东西重新排序2只能在x86上的编译时存储彼此.如何内存重新排序对处理器和编译器有帮助吗?) In the below code the write to a in foo is stored in store buffer and not visible to the ra in bar. Similarly the write to b in bar is not visible to rb in foo and they print 00. The below code is almost the same as above except the variable b is removed and both foo and bar have the same variable 'a' and the return value is stored in ra1 and ra2. In this case i never get a "00" atleast after running for 5 minutes.
A thread always sees its own operations in program order, even if they're on atomic objects using And BTW, you're actually correct that the value can be forwarded directly from the store buffer before it hits L1d cache and becomes globally visible. (Because you didn't use So in practice the loads are very likely to see the You could check by storing a Your analysis of why you can see In the below code the write to a in foo is stored in store buffer and not visible to the ra in bar. Similarly the write to b in bar is not visible to rb in foo Yes the store buffer is the normal cause of StoreLoad reordering, and if both But the store buffer is always trying to drain itself as fast as possible and commit pending stores to L1d where they're globally visible. That's why it's rare to actually see It's definitely not true that a write to (Semi related: Store Load reordering is the "most important" one for performance, and the most expensive one to block. For example x86 asm always blocks the other kinds, being program-order + store-forwarding, so stuff reordering 2 stores wrt. each other can only happen at compile-time on x86. How does memory reordering help processors and compilers?) 这篇关于多线程原子a b为memory_order_relaxed打印00的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋! mo_relaxed
的原子对象上./strong>这基本上等同于确保存储和加载发生在asm中(按程序顺序),但是当其他线程观察到时,没有任何额外的障碍可防止运行时重新排序, mo_seq_cst
;在全局可见之前的seq_cst存储区之前,不会发生seq_cst加载.例如,在x86上,必须将其编译为 xchg
存储区,或 mov
+ mfence
.半相关:在全球范围内尽管有关存储转发的一般要点适用于大多数主流CPU(包括大多数ARM),但看不见的加载指令关于加载结果来自x86的位置. 1
,而不是另一个线程的 1
,因为它们会编译为asm允许存储转发到加载,而加载恰好在此之后,因此它可能已经在执行并等待存储数据被转发的过程中,然后在其他线程之间没有可见的其他线程的窗口之间,除非存储与加载之间将出现中断. 1
和 2
进行检查,以查看是否始终获得 12
或有时获得 21
. 00
的分析非常草率. foo
和 bar
几乎同时执行,那么是的,两种加载都可能发生,并且在任一存储都可以将自身提交到L1d缓存之前抢占旧值.因此,如果确实发生了,那是因为存储缓冲区. 00
的原因.通常,一个内核将获得缓存行的专有所有权,并在另一内核的负载可以运行之前提交其存储. a
的写入对于另一个线程中的负载是不可见的,这绝对是不正确的.它可能会或可能不会发生.// g++ -O2 -pthread axbx.cpp ; while [ true ]; do ./a.out | grep "00"; done prints 00 within 1min
#include<atomic>
#include<thread>
#include<cstdio>
using namespace std;
atomic<long> a,b;
long ra,rb;
void foo(){
a.store(1,memory_order_relaxed);
rb=b.load(memory_order_relaxed);
}
void bar(){
b.store(1,memory_order_relaxed);
ra=a.load(memory_order_relaxed);
}
int main(){
thread t[2]{ thread(foo),thread(bar)};
t[0].join();t[1].join();
if((ra==0) && (rb==0)) printf("00\n"); // each cpu store buffer writes not visible to other threads.
}
// g++ -O2 -pthread axbx.cpp ; while [ true ]; do ./a.out | grep "00"; done doesn't print 00 within 5 min
#include<atomic>
#include<thread>
#include<cstdio>
using namespace std;
atomic<long> a,b;
long ra1,ra2;
void foo(){
a.store(1,memory_order_relaxed);
ra1=a.load(memory_order_relaxed);
}
void bar(){
a.store(1,memory_order_relaxed);
ra2=a.load(memory_order_relaxed);
}
int main(){
thread t[2]{ thread(foo),thread(bar)};
t[0].join();t[1].join();
if((ra1==0) && (ra2==0)) printf("00\n"); // each cpu store buffer writes not visible to other threads.
}
a.store(1, mo_relaxed)
is sequenced before a.load
in the same thread (in both foo and bar), so both loads must see that store result (or another later store value). That makes it impossible for either load to see the initial 0.mo_relaxed
. That's basically equivalent to making sure the stores and loads happen in asm (in program order), but without any extra barriers to prevent runtime reordering when observed by other threads, like if you'd used volatile
. (But don't). The cardinal rule of out-of-order execution is "don't break single-threaded code".
mo_seq_cst
; seq_cst loads can't happen until previous seq_cst stores are globally visible. e.g. on x86 it would have to compile to an xchg
store, or mov
+ mfence
. Semi-related: Globally Invisible load instructions about where load results come from on x86, although the general point about store-forwarding applies to most mainstream CPUs including most ARM.)1
stored by their own thread, not the 1
from the other thread, because it will compile to asm that allows the store to forward to the load, and the load is right after so it's probably already in the process of executing and waiting for the store-data to be forwarded before there's any window for the other thread's store to be visible between them, unless an interrupt arrives between the store and load.1
and a 2
, for example, to see if you always get 12
or sometimes get 21
.
00
in your version using 2 variables is pretty sloppy.
foo
and bar
happen to execute at nearly the same time, then yes both loads can happen and grab the old values before either store can get itself committed to L1d cache. So if that does happen, then yes it's because of the store buffer.00
. Usually one core will get exclusive ownership of the cache line and commit its store before the other core's load can run.a
will be invisible to a load in another thread. It might or might not happen.