GCC使用`memory_order_seq_cst`跨负载重新排序.可以吗? [英] GCC reordering up across load with `memory_order_seq_cst`. Is this allowed?

查看:90
本文介绍了GCC使用`memory_order_seq_cst`跨负载重新排序.可以吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用基本 seqlock 的简化版本,gcc会重新排列整个原子上的非原子负载使用-O3编译代码时,使用load(memory_order_seq_cst).使用其他优化级别进行编译或使用clang进行编译(即使在O3上)时,也不会观察到这种重新排序.这种重新排序似乎违反了应该建立的同步关系,我很好奇为什么gcc会对这种特殊负载进行重新排序,甚至标准也允许这样做.

Using a simplified version of a basic seqlock , gcc reorders a nonatomic load up across an atomic load(memory_order_seq_cst) when compiling the code with -O3. This reordering isn't observed when compiling with other optimization levels or when compiling with clang ( even on O3 ). This reordering seems to violate a synchronizes-with relationship that should be established and I'm curious to know why gcc reorders this particular load and if this is even allowed by the standard.

请考虑以下load函数:

auto load()
{
    std::size_t copy;
    std::size_t seq0 = 0, seq1 = 0;
    do
    {
        seq0 = seq_.load();
        copy = value;
        seq1 = seq_.load();
    } while( seq0 & 1 || seq0 != seq1);

    std::cout << "Observed: " << seq0 << '\n';
    return copy;
}

遵循seqlock过程,此读取器旋转直到能够加载seq_的两个实例(定义为std::atomic<std::size_t>)时,它们的偶数(表示写者当前不在写)相等且等于(表示编写者尚未在seq_的两个加载之间写入value).此外,由于这些负载用memory_order_seq_cst标记(作为默认参数),我可以想象copy = value;指令将在每次迭代时执行,因为它不能在初始负载中重新排序,也不能在下面重新排序后者.

Following seqlock procedure, this reader spins until it is able to load two instances of seq_, which is defined to be a std::atomic<std::size_t>, that are even ( to indicate that a writer is not currently writing ) and equal ( to indicate that a writer has not written to value in between the two loads of seq_ ). Furthermore, because these loads are tagged with memory_order_seq_cst ( as a default argument ), I would imagine that the instruction copy = value; would be executed on each iteration as it cannot be reordered up across the initial load, nor can it reordered down below the latter.

但是,生成的程序集seq_的首次加载之前发布了value的加载.甚至在循环之外执行.这可能导致不正确的同步或value的读取撕裂,而seqlock算法无法解决该错误.此外,我注意到只有在sizeof(value)小于123个字节时才会发生这种情况.将value修改为某种类型> = 123字节可产生正确的汇编,并在两次seq_加载之间的每次循环迭代时加载.为什么看似任意的阈值指示生成哪个程序集,这有什么原因吗?

However, the generated assembly issues the load from value before the first load from seq_ and is even performed outside of the loop. This could lead to improper synchronization or torn reads of value that do not get resolved by the seqlock algorithm. Additionally, I've noticed that this only occurs when sizeof(value) is below 123 bytes. Modifying value to be of some type >= 123 bytes yields the correct assembly and is loaded upon each loop iteration in between the two loads of seq_. Is there any reason why this seemingly arbitrary threshold dictates which assembly is generated?

此测试工具展示了我的Xeon E3-1505M上的行为,其中观察到:阅读器将打印2英寸,并将返回值65535. seq_的观测值与value的返回负载的这种组合似乎违反了同步关系,该关系应由编写器线程将发布seq.store(2)memory_order_release以及读取器线程将seq_memory_order_seq_cst.

This test harness exposes the behavior on my Xeon E3-1505M, in which "Observed: 2" will be printed from the reader and the value 65535 will be returned. This combination of observed values of seq_ and the returned load from value seem to violate the synchronizes-with relationship that should be established by the writer thread publishing seq.store(2) with memory_order_release and the reader thread reading seq_ with memory_order_seq_cst.

gcc对负载重新排序是否有效?如果是,为什么仅当sizeof(value)为<时才这样做? 123? clang,无论优化级别或sizeof(value)都不会重新排序负载.我相信Clang的代码生成是正确而正确的方法.

Is it valid for gcc to reorder the load, and if so, why does it only do so when sizeof(value) is < 123? clang, no matter the optimization level or the sizeof(value) will not reorder the load. Clang's codegen, I believe, is the appropriate and correct approach.

推荐答案

恭喜,我认为您遇到了gcc中的错误!

Congratulations, I think you've hit a bug in gcc!

现在,我认为您可以提出一个合理的论据,就像其他答案所做的那样,您显示的原始代码<可能已经通过gcc正确地优化了,因为它依赖于一个关于无条件访问value的相当模糊的论点:本质上,您不能一直依赖于加载seq0 = seq_.load();和随后的value读取,因此在其他地方"读取它不应更改无种族限制程序的语义.我实际上不确定这个参数,但是这是我通过减少您的代码得到的更简单"的情况:

Now I think you can make a reasonable argument, as the other answer does, that the original code you showed could perhaps have been correctly optimized that way by gcc by relying on a fairly obscure argument about the unconditional access to value: essentially you can't have been relying on a synchronizes-with relationship between the load seq0 = seq_.load(); and the subsequent read of value, so reading it "somewhere else" shouldn't change the semantics of a race-free program. I'm not actually sure of this argument, but here's a "simpler" case I got from reducing your code:

#include <atomic>
#include <iostream>

std::atomic<std::size_t> seq_;
std::size_t value;

auto load()
{
    std::size_t copy;
    std::size_t seq0;
    do
    {
        seq0 = seq_.load();
        if (!seq0) continue;
        copy = value;
        seq0 = seq_.load();
    } while (!seq0);

    return copy;
}

这不是seqlock或其他任何内容,它只是等待seq0从零变为非零,然后读取value.与while条件一样,对seq_的第二次读取是多余的,但是没有它们,该错误就会消失.

This isn't a seqlock or anything - it just waits for seq0 to change from zero to non-zero, and then reads value. The second read of seq_ is superfluous as is the while condition, but without them the bug goes away.

这是起作用的众所周知惯用语的读取端,并且没有竞争:一个线程写入value,然后将seq0设置为非零并释放店铺.调用load的线程会看到非零存储并与其进行同步,因此可以安全地读取value.当然,您不能继续写value,这是一次一次"初始化,但这是一种常见的模式.

This is now the read-side of the well known idiom which does work and is race-free: one thread writes to value, then sets seq0 non-zero with a release store. The threads calling load see the non-zero store, and synchronize with it, and so can safely read value. Of course, you can't keep writing to value, it's a "one time" initialization, but this a common pattern.

使用上面的代码,gcc是 still 提升了value :

With the above code, gcc is still hoisting the read of value:

load():
        mov     rax, QWORD PTR value[rip]
.L2:
        mov     rdx, QWORD PTR seq_[rip]
        test    rdx, rdx
        je      .L2
        mov     rdx, QWORD PTR seq_[rip]
        test    rdx, rdx
        je      .L2
        rep ret

糟糕!

直到gcc 7.3才会出现此现象,但在8.1中不会出现.您的代码也可以按照8.1的要求进行编译:

This behavior occurs up to gcc 7.3, but not in 8.1. Your code also compiles as you wanted in 8.1:

    mov     rbx, QWORD PTR seq_[rip]
    mov     rbp, QWORD PTR value[rip]
    mov     rax, QWORD PTR seq_[rip]

这篇关于GCC使用`memory_order_seq_cst`跨负载重新排序.可以吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆