为什么编译器不合并冗余的std :: atomic写? [英] Why don't compilers merge redundant std::atomic writes?

查看:54
本文介绍了为什么编译器不合并冗余的std :: atomic写?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道为什么没有编译器准备将相同值的连续写入合并到单个原子变量,例如:

#include <atomic>
std::atomic<int> y(0);
void f() {
  auto order = std::memory_order_relaxed;
  y.store(1, order);
  y.store(1, order);
  y.store(1, order);
}

我尝试过的每个编译器都会发出3次上述写操作.哪个合法的,不分种族的观察者可以看到上面的代码与一次写入的优化版本之间的区别(即,视情况"规则不适用)吗?

如果变量是易失性的,那么显然没有优化是适用的.在我的情况下,有什么预防措施?

这是编译器资源管理器中的代码.

解决方案

所编写的C ++ 11/C ++ 14标准 确实允许将这三个商店折叠/合并为一个商店最终值.即使在这样的情况下:

  y.store(1, order);
  y.store(2, order);
  y.store(3, order); // inlining + constant-folding could produce this in real code

该标准不会保证在y上旋转的观察者(具有原子载荷或CAS)将始终看到y == 2.依赖于此的程序将具有数据争用错误,但是只有花园变化的错误类型的竞争,而不是C ++ Undefined Behavior类的数据竞争. (仅具有非原子变量的UB).期望有时看到的程序甚至不一定是错误的. (请参见下面的re:进度条.)

在C ++抽象机上可能发生的任何排序都可以(在编译时)进行选择,因为这种排序会总是发生.这是实际操作中的规则.在这种情况下,就像所有所有三个存储都以全局顺序连续发生一样,在y=1y=3之间没有其他线程的加载或存储.

它不依赖于目标体系结构或硬件;它取决于目标体系结构或硬件.就像轻松进行原子操作的编译时重新排序一样,指定有序的x86时.编译器不必保留考虑要编译的硬件时可能期望的任何内容,因此您需要设置障碍.这些障碍可能会汇编成零个asm指令.


那为什么编译器不做这种优化呢?

这是实现质量问题,并且可以更改在实际硬件上观察到的性能/行为.

最明显的问题是进度条.使商店陷入循环(不包含其他原子操作),并将它们全部折叠为一个,将导致进度条保持为0,然后最终达到100%.

在您不想要的情况下,没有C ++ 11 std::atomic方式停止,因此现在,编译器只是选择从不将多个原子操作合并为一个. (将它们全部合并到一个操作中不会改变它们相对于彼此的顺序.)

编译器编写者正确地注意到,程序员期望每次源代码执行y.store()时,原子存储实际上都会在内存中发生. (请参阅此问题的其他大多数答案,这些答案声称商店必须分开进行,因为可能有读者等待看到中间值.)即违反了 http://wg21.link/n4455 :N4455没有Sane编译器可以优化原子

  • http://wg21.link/p0062 :WG21/P0062R1:编译器何时应优化原子? /li>

    另请参阅关于Richard Hodges对答案的num ++可以是原子的吗? (请参阅评论).另请参见我的答案的最后一部分,该问题在我的更详细的论述中是允许这种优化的. (在此省略,因为那些C ++工作组链接已经承认当前的书面标准允许这样做,并且当前的编译器只是没有故意进行优化.)


    在当前标准内, volatile atomic<int> y 是确保不允许对其进行优化存储的一种方法. (正如 Herb Sutter在SO答案中指出的那样volatileatomic已经有一些要求,但是它们是不同的).另请参见 std::memory_order与cppreference上与volatile 的关系

    不允许优化对volatile对象的访问(例如,因为它们可能是内存映射的IO寄存器).

    使用volatile atomic<T>基本上可以解决进度条问题,但是如果C ++决定使用不同的语法来控制优化,以便编译器可以在实践中开始使用它,那么这种丑陋的外观在几年后可能会显得很愚蠢.

    我认为我们可以确信,除非有一种控制方法,否则编译器不会开始进行此优化.希望它将成为某种选择加入(例如memory_order_release_coalesce),无论以何种形式编译为C ++,它们都不会改变现有代码C ++ 11/14的行为.但这可能就像wg21/p0062中的建议:标签不要使用[[brittle_atomic]]优化案例.

    wg21/p0062警告甚至volatile atomic也不能解决所有问题,因此不鼓励将其用于此目的.它给出了以下示例:

    if(x) {
        foo();
        y.store(0);
    } else {
        bar();
        y.store(0);  // release a lock before a long-running loop
        for() {...} // loop contains no atomics or volatiles
    }
    // A compiler can merge the stores into a y.store(0) here.
    

    即使使用volatile atomic<int> y,也允许编译器将y.store()if/else中拉出并只执行一次,因为它仍在执行具有相同值的1个存储. (这将在else分支的长循环之后).尤其是如果商店仅是relaxedrelease而不是seq_cst.

    volatile确实停止了问题中讨论的合并,但这指出atomic<>上的其他优化对于实际性能也可能存在问题.


    不进行优化的其他原因包括:没有人编写复杂的代码来使编译器安全地进行这些优化(不会出错).这还不够,因为N4455表示LLVM已经实现或可以轻松实现它提到的几种优化.

    尽管如此,对于程序员而言令人困惑的原因当然是合理的.首先,无锁代码很难正确编写.

    在使用原子武器时请不要随便:它们并不便宜,优化也不多(目前根本没有).但是,使用std::shared_ptr<T>避免冗余原子操作并不总是那么容易,因为它没有非原子版本(尽管compiler explorer.

    解决方案

    The C++11 / C++14 standards as written do allow the three stores to be folded/coalesced into one store of the final value. Even in a case like this:

      y.store(1, order);
      y.store(2, order);
      y.store(3, order); // inlining + constant-folding could produce this in real code
    

    The standard does not guarantee that an observer spinning on y (with an atomic load or CAS) will ever see y == 2. A program that depended on this would have a data race bug, but only the garden-variety bug kind of race, not the C++ Undefined Behaviour kind of data race. (It's UB only with non-atomic variables). A program that expects to sometimes see it is not necessarily even buggy. (See below re: progress bars.)

    Any ordering that's possible on the C++ abstract machine can be picked (at compile time) as the ordering that will always happen. This is the as-if rule in action. In this case, it's as if all three stores happened back-to-back in the global order, with no loads or stores from other threads happening between the y=1 and y=3.

    It doesn't depend on the target architecture or hardware; just like compile-time reordering of relaxed atomic operations are allowed even when targeting strongly-ordered x86. The compiler doesn't have to preserve anything you might expect from thinking about the hardware you're compiling for, so you need barriers. The barriers may compile into zero asm instructions.


    So why don't compilers do this optimization?

    It's a quality-of-implementation issue, and can change observed performance / behaviour on real hardware.

    The most obvious case where it's a problem is a progress bar. Sinking the stores out of a loop (that contains no other atomic operations) and folding them all into one would result in a progress bar staying at 0 and then going to 100% right at the end.

    There's no C++11 std::atomic way to stop them from doing it in cases where you don't want it, so for now compilers simply choose never to coalesce multiple atomic operations into one. (Coalescing them all into one operation doesn't change their order relative to each other.)

    Compiler-writers have correctly noticed that programmers expect that an atomic store will actually happen to memory every time the source does y.store(). (See most of the other answers to this question, which claim the stores are required to happen separately because of possible readers waiting to see an intermediate value.) i.e. It violates the principle of least surprise.

    However, there are cases where it would be very helpful, for example avoiding useless shared_ptr ref count inc/dec in a loop.

    Obviously any reordering or coalescing can't violate any other ordering rules. For example, num++; num--; would still have to be full barrier to runtime and compile-time reordering, even if it no longer touched the memory at num.


    Discussion is under way to extend the std::atomic API to give programmers control of such optimizations, at which point compilers will be able to optimize when useful, which can happen even in carefully-written code that isn't intentionally inefficient. Some examples of useful cases for optimization are mentioned in the following working-group discussion / proposal links:

    See also discussion about this same topic on Richard Hodges' answer to Can num++ be atomic for 'int num'? (see the comments). See also the last section of my answer to the same question, where I argue in more detail that this optimization is allowed. (Leaving it short here, because those C++ working-group links already acknowledge that the current standard as written does allow it, and that current compilers just don't optimize on purpose.)


    Within the current standard, volatile atomic<int> y would be one way to ensure that stores to it are not allowed to be optimized away. (As Herb Sutter points out in an SO answer, volatile and atomic already share some requirements, but they are different). See also std::memory_order's relationship with volatile on cppreference.

    Accesses to volatile objects are not allowed to be optimized away (because they could be memory-mapped IO registers, for example).

    Using volatile atomic<T> mostly fixes the progress-bar problem, but it's kind of ugly and might look silly in a few years if/when C++ decides on different syntax for controlling optimization so compilers can start doing it in practice.

    I think we can be confident that compilers won't start doing this optimization until there's a way to control it. Hopefully it will be some kind of opt-in (like a memory_order_release_coalesce) that doesn't change the behaviour of existing code C++11/14 code when compiled as C++whatever. But it could be like the proposal in wg21/p0062: tag don't-optimize cases with [[brittle_atomic]].

    wg21/p0062 warns that even volatile atomic doesn't solve everything, and discourages its use for this purpose. It gives this example:

    if(x) {
        foo();
        y.store(0);
    } else {
        bar();
        y.store(0);  // release a lock before a long-running loop
        for() {...} // loop contains no atomics or volatiles
    }
    // A compiler can merge the stores into a y.store(0) here.
    

    Even with volatile atomic<int> y, a compiler is allowed to sink the y.store() out of the if/else and just do it once, because it's still doing exactly 1 store with the same value. (Which would be after the long loop in the else branch). Especially if the store is only relaxed or release instead of seq_cst.

    volatile does stop the coalescing discussed in the question, but this points out that other optimizations on atomic<> can also be problematic for real performance.


    Other reasons for not optimizing include: nobody's written the complicated code that would allow the compiler to do these optimizations safely (without ever getting it wrong). This is not sufficient, because N4455 says LLVM already implements or could easily implement several of the optimizations it mentioned.

    The confusing-for-programmers reason is certainly plausible, though. Lock-free code is hard enough to write correctly in the first place.

    Don't be casual in your use of atomic weapons: they aren't cheap and don't optimize much (currently not at all). It's not always easy easy to avoid redundant atomic operations with std::shared_ptr<T>, though, since there's no non-atomic version of it (although one of the answers here gives an easy way to define a shared_ptr_unsynchronized<T> for gcc).

    这篇关于为什么编译器不合并冗余的std :: atomic写?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆