多线程原子a b为memory_order_relaxed打印00 [英] Multithreading atomics a b printing 00 for memory_order_relaxed

查看:58
本文介绍了多线程原子a b为memory_order_relaxed打印00的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在下面的代码中,对in foo的写操作存储在存储缓冲区中,并且对ra in bar不可见.同样,对foo中的rb不可见对bar中的b的写入,并且它们显示00.

 //g ++ -O2 -pthread axbx.cpp;而[true];做./a.out |grep"00";1分钟内完成打印00#include< atomic>#include< thread>#include< cstdio>使用命名空间std;原子长a,b;长ra,rb;无效foo(){a.store(1,memory_order_relaxed);rb = b.load(memory_order_relaxed);}无效bar(){b.store(1,memory_order_relaxed);ra = a.load(memory_order_relaxed);}int main(){线程t [2] {线程(foo),线程(bar)};t [0] .join(); t [1] .join();if((ra == 0)&&(rb == 0))printf("00 \ n");//每个cpu存储缓冲区写入的内容对其他线程均不可见.} 

下面的代码与上面的代码几乎相同,只是删除了变量b,并且foo和bar具有相同的变量'a',并且返回值存储在ra1和ra2中.在这种情况下,我永远不会得到"00".跑步5分钟后至少.

  1. 在第二种情况下,为什么不打印00?怎么写到x没有将两个线程都存储在cpu缓存中,然后输出00?
  2. 它与x86_64有什么关系,但在arm/arm64/power上显示00吗?
  3. 如果arm/arm64/power打印00,存储在foo和bar中的smp_mb()是否可以修复它?

 //g ++ -O2 -pthread axbx.cpp;而[true];做./a.out |grep"00";完成在5分钟内不会打印00#include< atomic>#include< thread>#include< cstdio>使用命名空间std;原子长a,b;长ra1,ra2;无效foo(){a.store(1,memory_order_relaxed);ra1 = a.load(memory_order_relaxed);}无效bar(){a.store(1,memory_order_relaxed);ra2 = a.load(memory_order_relaxed);}int main(){线程t [2] {线程(foo),线程(bar)};t [0] .join(); t [1] .join();if((ra1 == 0)&&(ra2 == 0))printf("00 \ n");//每个cpu存储缓冲区写入的内容对其他线程均不可见.} 

解决方案

a.store(1,mo_relaxed)之前 a.load 在同一线程中(在foo和bar中),因此两个加载都必须看到该存储结果(或另一个更高的存储值).这使得任一加载都看不到初始0.

线程始终总是按程序顺序看到自己的操作,即使它们在使用 mo_relaxed 的原子对象上./strong>这基本上等同于确保存储和加载发生在asm中(按程序顺序),但是当其他线程观察到时,没有任何额外的障碍可防止运行时重新排序,


顺便说一句,实际上,您可以正确地将值设为已转发直接从存储缓冲区,然后才到达L1d高速缓存并变为全局可见.(因为您没有使用 mo_seq_cst ;在全局可见之前的seq_cst存储区之前,不会发生seq_cst加载.例如,在x86上,必须将其编译为 xchg 存储区,或 mov + mfence .半相关:在全球范围内尽管有关存储转发的一般要点适用于大多数主流CPU(包括大多数ARM),但看不见的加载指令关于加载结果来自x86的位置.

因此在实践中,负载很可能会看到自己线程存储的 1 ,而不是另一个线程的 1 ,因为它们会编译为asm允许存储转发到加载,而加载恰好在此之后,因此它可能已经在执行并等待存储数据被转发的过程中,然后在其他线程之间没有可见的其他线程的窗口之间,除非存储与加载之间将出现中断.

例如,您可以通过存储 1 2 进行检查,以查看是否始终获得 12 或有时获得 21 .


您对为什么使用2个变量在您的版本中看到 00 的分析非常草率.

在下面的代码中,对in foo的写操作存储在存储缓冲区中,并且对ra in bar不可见.同样,对foo中的rb不可见对bar中的b的写操作

是,存储缓冲区是StoreLoad重新排序的正常原因,并且如果 foo bar 几乎同时执行,那么是的,两种加载都可能发生,并且在任一存储都可以将自身提交到L1d缓存之前抢占旧值.因此,如果确实发生了,那是因为存储缓冲区.

但是,存储缓冲区始终试图尽快耗尽自身资源,并将待处理的存储提交到全局可见的L1d.这就是为什么很少见到 00 的原因.通常,一个内核将获得缓存行的专有所有权,并在另一内核的负载可以运行之前提交其存储.

a 的写入对于另一个线程中的负载是不可见的,这绝对是不正确的.它可能会或可能不会发生.

(半相关:存储负载重新排序是性能上最重要"的,也是最昂贵的一个,例如x86 asm 始终阻止其他类型,即程序顺序+存储转发,因此东西重新排序2只能在x86上的编译时存储彼此.如何内存重新排序对处理器和编译器有帮助吗?)

In the below code the write to a in foo is stored in store buffer and not visible to the ra in bar. Similarly the write to b in bar is not visible to rb in foo and they print 00.

// g++ -O2 -pthread axbx.cpp ; while [ true ]; do ./a.out | grep "00"; done prints 00 within 1min
#include<atomic>
#include<thread>
#include<cstdio>
using namespace std;
atomic<long> a,b;
long ra,rb;
void foo(){
        a.store(1,memory_order_relaxed);
        rb=b.load(memory_order_relaxed);
}
void bar(){
        b.store(1,memory_order_relaxed);
        ra=a.load(memory_order_relaxed);
}
int main(){
  thread t[2]{ thread(foo),thread(bar)};
  t[0].join();t[1].join();
  if((ra==0) && (rb==0)) printf("00\n"); // each cpu store buffer writes not visible to other threads.
}

The below code is almost the same as above except the variable b is removed and both foo and bar have the same variable 'a' and the return value is stored in ra1 and ra2. In this case i never get a "00" atleast after running for 5 minutes.

  1. In the second case why doesn't it print 00 ? How come writes to x are not stored in cpu cache for both threads and then print 00 ?
  2. Does it have anything to do with x86_64 but it prints 00 on arm/arm64/power ?
  3. If arm/arm64/power prints 00 , will a smp_mb() after store in foo and bar fix it ?

// g++ -O2 -pthread axbx.cpp ; while [ true ]; do ./a.out | grep "00"; done doesn't print 00 within 5 min
#include<atomic>
#include<thread>
#include<cstdio>
using namespace std;
atomic<long> a,b;
long ra1,ra2;
void foo(){
        a.store(1,memory_order_relaxed);
        ra1=a.load(memory_order_relaxed);
}
void bar(){
        a.store(1,memory_order_relaxed);
        ra2=a.load(memory_order_relaxed);
}
int main(){
  thread t[2]{ thread(foo),thread(bar)};
  t[0].join();t[1].join();
  if((ra1==0) && (ra2==0)) printf("00\n"); // each cpu store buffer writes not visible to other threads.
}

解决方案

a.store(1, mo_relaxed) is sequenced before a.load in the same thread (in both foo and bar), so both loads must see that store result (or another later store value). That makes it impossible for either load to see the initial 0.

A thread always sees its own operations in program order, even if they're on atomic objects using mo_relaxed. That's basically equivalent to making sure the stores and loads happen in asm (in program order), but without any extra barriers to prevent runtime reordering when observed by other threads, like if you'd used volatile. (But don't). The cardinal rule of out-of-order execution is "don't break single-threaded code".


And BTW, you're actually correct that the value can be forwarded directly from the store buffer before it hits L1d cache and becomes globally visible. (Because you didn't use mo_seq_cst; seq_cst loads can't happen until previous seq_cst stores are globally visible. e.g. on x86 it would have to compile to an xchg store, or mov + mfence. Semi-related: Globally Invisible load instructions about where load results come from on x86, although the general point about store-forwarding applies to most mainstream CPUs including most ARM.)

So in practice the loads are very likely to see the 1 stored by their own thread, not the 1 from the other thread, because it will compile to asm that allows the store to forward to the load, and the load is right after so it's probably already in the process of executing and waiting for the store-data to be forwarded before there's any window for the other thread's store to be visible between them, unless an interrupt arrives between the store and load.

You could check by storing a 1 and a 2, for example, to see if you always get 12 or sometimes get 21.


Your analysis of why you can see 00 in your version using 2 variables is pretty sloppy.

In the below code the write to a in foo is stored in store buffer and not visible to the ra in bar. Similarly the write to b in bar is not visible to rb in foo

Yes the store buffer is the normal cause of StoreLoad reordering, and if both foo and bar happen to execute at nearly the same time, then yes both loads can happen and grab the old values before either store can get itself committed to L1d cache. So if that does happen, then yes it's because of the store buffer.

But the store buffer is always trying to drain itself as fast as possible and commit pending stores to L1d where they're globally visible. That's why it's rare to actually see 00. Usually one core will get exclusive ownership of the cache line and commit its store before the other core's load can run.

It's definitely not true that a write to a will be invisible to a load in another thread. It might or might not happen.

(Semi related: Store Load reordering is the "most important" one for performance, and the most expensive one to block. For example x86 asm always blocks the other kinds, being program-order + store-forwarding, so stuff reordering 2 stores wrt. each other can only happen at compile-time on x86. How does memory reordering help processors and compilers?)

这篇关于多线程原子a b为memory_order_relaxed打印00的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆