出于排序的目的,原子读-修改-写是一个操作还是两个操作? [英] For purposes of ordering, is atomic read-modify-write one operation or two?

查看:26
本文介绍了出于排序的目的,原子读-修改-写是一个操作还是两个操作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

考虑原子读-修改-写操作,例如 x.exchange(..., std::memory_order_acq_rel).出于对其他对象的加载和存储进行排序的目的,这是否被视为:

  1. 具有获取-释放语义的单个操作?

如果它是 #2,那么尽管在加载之前或存储之后不能对同一线程中的其他操作进行重新排序,但仍然存在在两者之间重新排序的可能性.

举一个具体的例子,考虑:

std::atomicx, y;无效线程_A(){x.exchange(1, std::memory_order_acq_rel);y.store(1, std::memory_order_relaxed);}无效线程_B(){//这两个负载不能重新排序int yy = y.load(std::memory_order_acquire);int xx = x.load(std::memory_order_acquire);std::cout <<xx<<"、"<<yy<

thread_B 是否可以输出0, 1?

如果 x.exchange() 被替换为 x.store(1, std::memory_order_release); 那么 thread_B 肯定可以输出 0, 1.exchange() 中的额外隐式加载是否应该排除这种情况?

cppreference 听起来像 #1 就是这种情况,0, 1 被禁止:

<块引用>

具有这种内存顺序的读-修改-写操作既是获取操作又是释放操作.在此存储之前或之后,无法对当前线程中的任何内存读取或写入进行重新排序.

但是我在标准中找不到任何明确的内容来支持这一点.实际上,该标准对原子读-修改-写操作几乎没有说明,除了 N4860 中的 31.4 (10) 这只是读取必须读取写入之前写入的最后一个值的明显属性.所以虽然我不想质疑 cppreference,但我想知道这是否真的正确.

我也在研究它是如何在 ARM64 上实现的.gcc 和 clang 都将 thread_A 编译为本质上

ldaxr [x]stlxr #1, [x]str #1, [y]

(参见 Godbolt.) 基于我对 ARM64 语义的理解,以及一些测试(使用加载 y 而不是存储),我认为 str [y] 可以在 stlxr [x] 之前变得可见(尽管当然不是在 ldaxr 之前).这将使thread_B 可以观察到0, 1.所以如果 #1 是真的,那么 gcc 和 clang 似乎都是错误的,我不敢相信.

最后,据我所知,将 memory_order_acq_rel 替换为 seq_cst 不会改变此分析的任何内容,因为它只会增加与其他 seq_cst 操作,我们这里没有.


我发现 C++ 内存模型中哪些确切的规则阻止了在获取操作之前重新排序?,如果我理解正确,这似乎同意 #2 是正确的,并且 0, 1可以观察到.我仍然很感激您的确认,以及检查 cppreference 引用是否实际上是错误的或我是否误解了它.

解决方案

不是语言标准层面的答案,而是一些证据,在实践中,答案可以是两个".正如我在问题中猜测的那样,即使 RMW 是 seq_cst,这也可能发生.

我无法像原始问题中那样观察到存储被重新排序,但这里有一个示例显示原子 seq_cst RMW 的存储被重新排序,并带有以下 relaxed 加载.

下面的程序是根据 获取释放内存顺序与顺序一致性不同的实际示例是什么?.正如那里所解释的,该算法的正确版本涉及

me.store(true, std::memory_order_seq_cst);if (other.load(std::memory_order_seq_cst) == false)//锁定

在商店之后负载变得可见是必不可少的.

如果 RMW 是用于排序语义的单个操作,我们希望这样做是安全的

me.exchange(true, std::memory_order_seq_cst);if (other.load(std::memory_order_relaxed) == false) {//确保在我们知道我们拥有锁之前临界区不会启动std::atomic_thread_fence(std::memory_order_seq_cst);//锁定}

理论上,由于交换操作已经获得语义,因此在交换完成后,负载必须变得可见,特别是在将true存储到me之后已变得可见.

但实际上在 ARMv8-a 上,使用 gcc 或 clang,这样的代码经常失败.看起来事实上,exchange 确实包含一个acquire-load 和一个release-store,并且other.load 可能在release-store 之前变得可见.(虽然不是在 exchange 的获取加载之前,但这在这里无关紧要.)

clang 生成如下代码:

mov w11, #1重试:ldaxrb wzr, [我]stlxrb w12, w11, [我]cbnz w12,重试ldrb w11, [其他]

参见https://godbolt.org/z/fhjjn7,第 116-120 行组装输出.(gcc 是相同的,但隐藏在库函数中.)通过 ARM64 内存排序语义,可以使用以下加载和存储对 release-store stlxrb 重新排序.它是排他性的这一事实并没有改变这一点.

为了更频繁地进行重新排序,我们将存储的数据安排为依赖于之前丢失缓存的加载,我们通过使用 dc civac 逐出该行来确保这一点.我们还需要将两个标志 meother 放在不同的缓存行上.否则,据我所知,即使线程 A 在存储之前进行加载,线程 B 也必须等到 A 的存储完成后才开始其 RMW,特别是在 A 的存储可见之前不会进行加载.

在多核 Cortex A72 (Raspberry Pi 4B) 上,断言通常会在几乎是瞬时的几千次迭代后失败.

代码需要用-O2构建.我怀疑如果是为 ARMv8.2 或更高版本构建的,其中 swpalb 可用,它将无法工作.

//基于 https://stackoverflow.com/a/41859912/634919 by LWimsey#include <线程>#include <atomic>#include <cassert>//大小至少与缓存行一样大constexpr size_t cache_line_size = 256;static void take_lock(std::atomic &me, std::atomic &other) {alignas(cache_line_size) bool uncached_true = true;为了 (;;) {//从缓存中驱逐 uncached_true.asm volatile(dc civac,%0"::r"(uncached_true):内存");//所以到`me`的发布存储可能会延迟//`uncached_true` 已加载.这应该给机器//是时候继续加载 `other`,这不是//被 store 的 release 语义禁止给 `me`.me.exchange(uncached_true, std::memory_order_seq_cst);if (other.load(std::memory_order_relaxed) == false) {//采取了!std::atomic_thread_fence(std::memory_order_seq_cst);返回;}//重来me.store(false, std::memory_order_seq_cst);}}static void drop_lock(std::atomic &me) {me.store(false, std::memory_order_seq_cst);}alignas(cache_line_size) std::atomic计数器{0};静态无效关键部分(无效){//我们应该是这里唯一的线程.int tmp = counter.fetch_add(1, std::memory_order_seq_cst);断言(tmp == 0);//延迟给另一个线程尝试锁的机会for (int i = 0; i <100; i++)汇编易失性(");tmp = counter.fetch_sub(1, std::memory_order_seq_cst);断言(tmp == 1);}static void busy(std::atomic *me, std::atomic *other){为了 (;;) {take_lock(*me, *other);std::atomic_thread_fence(std::memory_order_seq_cst);//偏执临界区();std::atomic_thread_fence(std::memory_order_seq_cst);//偏执drop_lock(*我);}}//这两个标志需要位于不同的缓存行上.alignas(cache_line_size) std::atomic标志1{假},标志2{假};int main(){std::thread t1(busy, &flag1, &flag2);std::thread t2(忙,&flag2,&flag1);t1.join();//永远不会发生t2.join();返回0;}

Consider an atomic read-modify-write operation such as x.exchange(..., std::memory_order_acq_rel). For purposes of ordering with respect to loads and stores to other objects, is this treated as:

  1. a single operation with acquire-release semantics?

  2. Or, as an acquire load followed by a release store, with the added guarantee that other loads and stores to x will observe both of them or neither?

If it's #2, then although no other operations in the same thread could be reordered before the load or after the store, it leaves open the possibility that they could be reordered in between the two.

As a concrete example, consider:

std::atomic<int> x, y;

void thread_A() {
    x.exchange(1, std::memory_order_acq_rel);
    y.store(1, std::memory_order_relaxed);
}

void thread_B() {
    // These two loads cannot be reordered
    int yy = y.load(std::memory_order_acquire);
    int xx = x.load(std::memory_order_acquire);
    std::cout << xx << ", " << yy << std::endl;
}

Is it possible for thread_B to output 0, 1?

If the x.exchange() were replaced by x.store(1, std::memory_order_release); then thread_B could certainly output 0, 1. Should the extra implicit load in exchange() rule that out?

cppreference makes it sound like #1 is the case and 0, 1 is forbidden:

A read-modify-write operation with this memory order is both an acquire operation and a release operation. No memory reads or writes in the current thread can be reordered before or after this store.

But I can't find anything explicit in the standard to support this. Actually the standard says very little about atomic read-modify-write operations at all, except 31.4 (10) in N4860 which is just the obvious property that the read has to read the last value written before the write. So although I hate to question cppreference, I'm wondering if this is actually correct.

I'm also looking at how it's implemented on ARM64. Both gcc and clang compile thread_A as essentially

ldaxr [x]
stlxr #1, [x]
str #1, [y]

(See on godbolt.) Based on my understanding of ARM64 semantics, and some tests (with a load of y instead of a store), I think that the str [y] can become visible before the stlxr [x] (though of course not before the ldaxr). This would make it possible for thread_B to observe 0, 1. So if #1 is true then it would seem that gcc and clang are both wrong, which I hesitate to believe.

Finally, as far as I can tell, replacing memory_order_acq_rel with seq_cst wouldn't change anything about this analysis, since it only adds semantics with respect to other seq_cst operations, and we don't have any here.


I found What exact rules in the C++ memory model prevent reordering before acquire operations? which, if I understand it correctly, seems to agree that #2 is correct, and that 0, 1 could be observed. I'd still appreciate confirmation, as well as a check on whether the cppreference quote is actually wrong or if I'm misunderstanding it.

解决方案

Not an answer at the level of the language standard, but some evidence that in practice, the answer can be "two". And as I guessed in the question, this can happen even if the RMW is seq_cst.

I haven't been able to observe stores being reordered as in the original question, but here is an example that shows the store of an atomic seq_cst RMW being reordered with a following relaxed load.

The program below is an implementation of Peterson's algorithm adapted from LWimsey's example in What's are practical example where acquire release memory order differs from sequential consistency?. As explained there, the correct version of the algorithm involves

me.store(true, std::memory_order_seq_cst);
if (other.load(std::memory_order_seq_cst) == false) 
    // lock taken

where it is essential that the load become visible after the store.

If RMW were a single operation for the purposes of ordering semantics, we would expect that it would be safe to do

me.exchange(true, std::memory_order_seq_cst);
if (other.load(std::memory_order_relaxed) == false) {
    // Ensure critical section doesn't start until we know we have the lock
    std::atomic_thread_fence(std::memory_order_seq_cst);
    // lock taken
}

on the theory that since the exchange operation has acquire semantics, the load must become visible after the exchange has completed, and in particular after the store of true to me has become visible.

But in fact on ARMv8-a, using either gcc or clang, such code frequently fails. It appears that in fact, exchange does consist of an acquire-load and a release-store, and that other.load may become visible before the release-store. (Though not before the acquire-load of the exchange, but that is irrelevant here.)

clang generates code like the following:

mov w11, #1
retry:
ldaxrb wzr, [me]
stlxrb w12, w11, [me]
cbnz w12, retry
ldrb w11, [other]

See https://godbolt.org/z/fhjjn7, lines 116-120 of the assembly output. (gcc is the same but buried inside a library function.) By ARM64 memory ordering semantics, the release-store stlxrb can be reordered with following loads and stores. The fact that it's exclusive doesn't change that.

To make the reordering happen more often, we arrange for the data being stored to depend on a previous load that missed cache, which we ensure by evicting that line with dc civac. We also need to put the two flags me and other on separate cache lines. Otherwise, as I understand it, even if thread A does its load before the store, then thread B has to wait to begin its RMW until after A's store completes, and in particular won't do its load until A's store is visible.

On a multi-core Cortex A72 (Raspberry Pi 4B), the assertion typically fails after a few thousand iterations, which is nearly instantaneous.

The code needs to be built with -O2. I suspect it will not work if built for ARMv8.2 or higher, where swpalb is available.

// Based on https://stackoverflow.com/a/41859912/634919 by LWimsey
#include <thread>
#include <atomic>
#include <cassert>

// size that's at least as big as a cache line
constexpr size_t cache_line_size = 256;

static void take_lock(std::atomic<bool> &me, std::atomic<bool> &other) {
    alignas(cache_line_size) bool uncached_true = true;
    for (;;) {
        // Evict uncached_true from cache.
        asm volatile("dc civac, %0" : : "r" (&uncached_true) : "memory");
        
        // So the release store to `me` may be delayed while
        // `uncached_true` is loaded.  This should give the machine
        // time to proceed with the load of `other`, which is not
        // forbidden by the release semantics of the store to `me`.
        
        me.exchange(uncached_true, std::memory_order_seq_cst);
        if (other.load(std::memory_order_relaxed) == false) {
            // taken!
            std::atomic_thread_fence(std::memory_order_seq_cst);
            return;
        }
        // start over
        me.store(false, std::memory_order_seq_cst);
    }
}

static void drop_lock(std::atomic<bool> &me) {
    me.store(false, std::memory_order_seq_cst);
}

alignas(cache_line_size) std::atomic<int> counter{0};

static void critical_section(void) {
    // We should be the only thread inside here.
    int tmp = counter.fetch_add(1, std::memory_order_seq_cst);
    assert(tmp == 0);
    
    // Delay to give the other thread a chance to try the lock
    for (int i = 0; i < 100; i++)
        asm volatile("");
    
    tmp = counter.fetch_sub(1, std::memory_order_seq_cst);
    assert(tmp == 1);
}    

static void busy(std::atomic<bool> *me, std::atomic<bool> *other)
{
    for (;;) {  
        take_lock(*me, *other);
        std::atomic_thread_fence(std::memory_order_seq_cst); // paranoia
        critical_section();
        std::atomic_thread_fence(std::memory_order_seq_cst); // paranoia
        drop_lock(*me);
    }
}


// The two flags need to be on separate cache lines.
alignas(cache_line_size) std::atomic<bool> flag1{false}, flag2{false};

int main()
{
    std::thread t1(busy, &flag1, &flag2);
    std::thread t2(busy, &flag2, &flag1);
    
    t1.join(); // will never happen
    t2.join();
    return 0;
}

这篇关于出于排序的目的,原子读-修改-写是一个操作还是两个操作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆