是否不能像存储区那样在x86上稍后加载时放宽对fetch_add的原子重新排序? [英] Can't relaxed atomic fetch_add reorder with later loads on x86, like store can?

查看:53
本文介绍了是否不能像存储区那样在x86上稍后加载时放宽对fetch_add的原子重新排序?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

该程序有时会打印00,但是如果我注释掉a.store和b.store并取消注释a.fetch_add和b.fetch_add,它们执行的操作完全相同,即都设置a = 1,b = 1的值,我永远不会得到 00 .(在带有g ++ -O2的x86-64 Intel i3上测试)

This program will sometimes print 00, but if I comment out a.store and b.store and uncomment a.fetch_add and b.fetch_add which does the exact same thing i.e both set the value of a=1,b=1 , I never get 00. (Tested on an x86-64 Intel i3, with g++ -O2)

我是否缺少某些内容,或者可以"00"显示从来没有发生过?

这是带有普通存储的版本,可以打印00.

This is the version with plain stores, which can print 00.

// g++ -O2 -pthread axbx.cpp  ; while [ true ]; do ./a.out  | grep "00" ; done
#include<cstdio>
#include<thread>
#include<atomic>
using namespace std;
atomic<int> a,b;
int reta,retb;

void foo(){
        //a.fetch_add(1,memory_order_relaxed);
        a.store(1,memory_order_relaxed);
        retb=b.load(memory_order_relaxed);
}

void bar(){
        //b.fetch_add(1,memory_order_relaxed);
        b.store(1,memory_order_relaxed);
        reta=a.load(memory_order_relaxed);
}

int main(){
        thread t[2]{ thread(foo),thread(bar) };
        t[0].join(); t[1].join();
        printf("%d%d\n",reta,retb);
        return 0;
}

以下内容从不打印00

The below never prints 00

// g++ -O2 -pthread axbx.cpp  ; while [ true ]; do ./a.out  | grep "00" ; done
#include<cstdio>
#include<thread>
#include<atomic>
using namespace std;
atomic<int> a,b;
int reta,retb;

void foo(){
        a.fetch_add(1,memory_order_relaxed);
        //a.store(1,memory_order_relaxed);
        retb=b.load(memory_order_relaxed);
}

void bar(){
        b.fetch_add(1,memory_order_relaxed);
        //b.store(1,memory_order_relaxed);
        reta=a.load(memory_order_relaxed);
}

int main(){
        thread t[2]{ thread(foo),thread(bar) };
        t[0].join(); t[1].join();
        printf("%d%d\n",reta,retb);
        return 0;
}

也请注意以下事项多线程原子,其打印00为memory_order_relaxed

推荐答案

该标准允许 00 ,但您永远不会在x86上获得它(没有编译时)重新排序).在x86上实现原子RMW的唯一方法涉及到 lock 前缀,它是完全屏障",对于seq_cst足够强大.

The standard allows 00, but you'll never get it on x86 (without compile-time reordering). The only way to implement atomic RMW on x86 involves a lock prefix, which is a "full barrier", which is strong enough for seq_cst.

在C ++中,原子RMW在为x86编译时有效地提升为seq_cst .(只有在确定可能的编译时顺序之后,例如,非原子负载/存储才可以在轻松的fetch_add中重新排序/合并,其他轻松的操作也可以如此,并且可以通过获取或释放操作进行单向重新排序.尽管编译器较少由于它们不会将它们合并在一起,因此很可能会相互重新排序原子操作.,而这是主要原因之一编译时重新排序.)

In C++ terms, atomic RMWs are effectively promoted to seq_cst when compiling for x86. (Only after possible compile-time ordering is nailed down - e.g. non-atomic loads / stores can reorder / combine across a relaxed fetch_add, and so can other relaxed operations, and one-way reordering with acquire or release operations. Although compilers are less likely to reorder atomic ops with each other since they don't combine them, and doing so is one of the main reasons for compile-time reordering.)

实际上,大多数编译器通过使用 xchg (具有隐式的 lock 前缀)来实现 a.store(1,mo_seq_cst),因为它在现代CPU上比 mov + mfence 快,并且使用 lock add 将0变为1,因为对每个对象的唯一写入是完全相同的.有趣的事实:只需存储和加载,您的代码即可与

In fact, most compilers implement a.store(1, mo_seq_cst) by using xchg (which has an implicit lock prefix), because it's faster than mov + mfence on modern CPUs, and turning 0 into 1 with lock add as the only write to each object is exactly identical. Fun fact: with just store and load, your code will compile to the same asm as https://preshing.com/20120515/memory-reordering-caught-in-the-act/, so the discussion there applies.

ISO C ++允许整个松弛的RMW随松弛的负载重新排序,但是普通的编译器不会无缘无故在编译时执行此操作.(可以/将要使用DeathStation 9000 C ++实现).因此,您终于找到了在不同的ISA上进行测试很有用的情况.原子RMW(甚至其中的一部分)在运行时可以重新排序的方式在很大程度上取决于ISA.

ISO C++ allows the whole relaxed RMW to reorder with the relaxed load, but normal compilers won't do that at compile-time for no reason. (A DeathStation 9000 C++ implementation could/would). So you've finally found a case where it would be useful to test on a different ISA. The ways in which an atomic RMW (or even parts of it) can reorder at run-time depend a lot on the ISA.

需要重试循环的 LL/SC 计算机实施fetch_add(例如ARM或)也许能够真正实现一个放松的RMW,该RMW可以在运行时重新排序,因为任何比放松的强的东西都将需要障碍.(或获取/发布说明的版本,例如 AArch64 ldaxr / stlxr vs. ldxr / stxr ).因此,如果relax和acq和/或rel之间存在asm差异(有时seq_cst也有所不同),则可能有必要区别并防止某些运行时重新排序.

An LL/SC machine that needs a retry loop to implement fetch_add (for example ARM, or AArch64 before ARMv8.1) may be able to truly implement a relaxed RMW that can reorder at run-time because anything stronger than relaxed would require barriers. (Or acquire / release versions of the instructions like AArch64 ldaxr / stlxr vs. ldxr/stxr). So if there's an asm difference between relaxed and acq and/or rel (and sometimes seq_cst is also different), it's likely that difference is necessary and preventing some run-time reordering.

即使是单指令原子操作,也可以在AArch64上真正放宽.我还没有调查传统上,大多数弱序ISA都使用LL/SC原子,因此我可能只是将它们混为一谈.

Even a single-instruction atomic operation might be able to be truly relaxed on AArch64; I haven't investigated. Most weakly-ordered ISAs have traditionally used LL/SC atomics, so I might just be conflating those.

在LL/SC机器中,LL/SC RMW的存储侧甚至可以将以后的负载与负载分开重新排序,除非它们都是seq_cst.出于订购目的,是原子读-修改-写一个或两个操作?

In an LL/SC machine, the store side of an LL/SC RMW can even reorder with later loads separately from the load, unless they're both seq_cst. For purposes of ordering, is atomic read-modify-write one operation or two?

要真正看到 00 ,两个加载都必须在另一个线程中看到RMW的存储部分之前发生.是的,我认为LL/SC机器中的HW重新排序机制与普通商店的重新排序非常相似.

To actually see 00, both loads would have to happen before the store part of the RMW was visible in the other thread. And yes, the HW reordering mechanism in an LL/SC machine would I think be pretty similar to reordering a plain store.

这篇关于是否不能像存储区那样在x86上稍后加载时放宽对fetch_add的原子重新排序?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆