如何使用c ++ 11 CAS实现ABA计数器? [英] How can I implement ABA counter with c++11 CAS?

查看:143
本文介绍了如何使用c ++ 11 CAS实现ABA计数器?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在基于此算法实现无锁队列,它使用计数器来解决ABA问题.但是我不知道如何用c ++ 11 CAS实现此计数器.例如,根据算法:

I am implementing a lock-free queue based on this algorithm, which uses a counter to solve the ABA problem. But I don't know how to implement this counter with c++11 CAS. For example, from the algorithm:

E9:    if CAS(&tail.ptr->next, next, <node, next.count+1>)

这是一个原子操作,这意味着如果tail.ptr->next等于next,让tail.ptr->next指向node,然后同时(原子上)生成next.count+1.但是,使用C ++ 11 CAS,我只能实现:

It is an atomic operation, meaning if tail.ptr->next is equal to next, let tail.ptr->next point to node and simultaneously (atomically) make next.count+1. However, using C++11 CAS, I can only implement:

std::atomic_compare_exchange_weak(&tail.ptr->next, next, node);

不能同时使next.count+1发生的

.

which cannot make next.count+1 simultaneously happen.

推荐答案

要通过单个原子操作一次原子地修改两件事,您需要将它们放在相邻的内存中,例如在两个成员的结构中.然后,您可以使用std::atomic<my_struct>来使gcc发出 lock cmpxchg16b 例如x86-64.

To atomically modify two things at once with a single atomic operation, you need to put them in adjacent memory, e.g. in a two-member struct. Then you can use std::atomic<my_struct> to get gcc to emit lock cmpxchg16b on x86-64, for example.

您不需要内联汇编,为此,值得避免一些C ++语法痛苦. https://gcc.gnu.org/wiki/DontUseInlineAsm .

You don't need inline asm for this, and it's worth a bit of C++ syntax pain to avoid it. https://gcc.gnu.org/wiki/DontUseInlineAsm.

不幸的是,对于当前的编译器,您需要使用union来获得有效的代码,以便仅读取一对代码中的一个.即使只需要一个成员,对结构进行原子加载然后仅使用一个成员的明显"方式仍然会导致lock cmpxchg16b读取整个结构. (速度要慢得多,并且会使高速缓存行变脏,因此阅读器与其他阅读器竞争).我相信正常的64b指针加载仍将正确实现x86上的获取内存排序语义(以及原子性),但是即使是std::memory_order_relaxed,当前的编译器也不会进行这种优化,因此我们欺骗了他们并加入工会.

Unfortunately, with current compilers, you need to use a union to get efficient code for reading just one of the pair. The "obvious" way of doing an atomic load of the struct and then only using one member still leads to a lock cmpxchg16b to read the whole struct even though we only need one member. (Much slower, and dirties the cache line so readers contend with other readers). I'm confident that a normal 64b load of the pointer would still correctly implement the acquire memory-ordering semantics on x86 (as well as atomicity), but current compilers don't do that optimization even for std::memory_order_relaxed, so we trick them into it with a union.

(为此提交了 GCC错误80835 .TODO:相同如果这是个有用的主意的话.)

(submitted GCC bug 80835 about this. TODO: same for clang if this is a useful idea.)

清单:

  • 确保您的编译器正在生成有效的代码,以便在只读情况下仅加载一个成员,而不是该对中的lock cmpxchg16b.例如使用工会.
  • 确保您的编译器保证在编写不同的并集成员之后访问该并集的一个成员在该实现中具有明确定义的行为. 联盟类型联动在C99中是合法的(因此这在C11 stdatomic中应该可以很好地工作),但是在ISO C ++ 11中它是UB .但是,这在C ++的GNU方言中是合法的(由gcc,clang和ICC等支持).
  • 确保您的对象是16B对齐的,或者对于32位指针,是8B对齐的.更一般地,alignas(2*sizeof(void*))应该可以工作.错误对齐的lock指令在x86上可能非常慢,尤其是当它们越过缓存行边界时.如果对象未对齐,则clang3.8甚至将其编译为库调用.
  • 使用-mcx16 编译用于x86-64构建.最早的x86-64 CPU(AMD K8)不支持cmpxchg16b,但此后的所有版本都应支持.如果不使用-mcx16,则会得到一个库函数调用(可能使用全局锁).等价于32位的cmpxchg8b足够老,以至于现代编译器都支持它. (并且可以使用SSE,MMX甚至x87进行64b原子加载/存储,因此对于读取一个成员时,使用联合对于获得良好性能不太重要.)

  • Make sure your compiler is generating efficient code for loading just one member in the read-only case, not a lock cmpxchg16b of the pair. e.g. using a union.
  • Make sure your compiler guarantees that accessing one member of a union after writing a different union member has well-defined behaviour in that implementation. Union type-punning is legal in in C99 (so this should work well with C11 stdatomic), but it's UB in ISO C++11. However, it's legal in the GNU dialect of C++ (supported by gcc, clang, and ICC, among others).
  • Make sure your object is 16B-aligned, or 8B-aligned for 32-bit pointers. More generally, alignas(2*sizeof(void*)) should work. Misaligned locked instructions can be very slow on x86, especially if they cross a cache-line boundary. clang3.8 even compiles it to a library call if the object isn't aligned.
  • Compile with -mcx16 for x86-64 builds. cmpxchg16b was not supported by the earliest x86-64 CPUs (AMD K8), but should be on everything after that. Without -mcx16, you get a library function call (which probably uses a global lock). The 32-bit equivalent, cmpxchg8b, is old enough that modern compilers assume its support. (And can use SSE, MMX, or even x87 to do 64b atomic loads/stores, so using a union is somewhat less important for good performance when reading one member).

确保指针+ uintptr_t原子对象是无锁的.对于x32和32位ABI(8B对象),几乎可以保证这一点,但对于16B对象则不能保证.例如MSVC使用x86-64的锁.

Make sure that a pointer+uintptr_t atomic object is lock-free. This is pretty much guaranteed for x32 and 32-bit ABIs (8B object), but not for 16B objects. e.g. MSVC uses a lock for x86-64.

gcc7及更高版本将调用libatomic而不是内联lock cmpxchg16b,并且将从atomic_is_lock_free返回false(

gcc7 and later will call libatomic instead of inlining lock cmpxchg16b, and will return false from atomic_is_lock_free (for reasons including that it's so slow it's not what users expect is_lock_free to mean), but at least for now the libatomic implementation still uses lock cmpxchg16b on targets where that instruction is available. (It can even segfault for read-only atomic objects, so it's really not ideal. More importantly, readers contend with other readers for exclusive access to the cache line. That's why we're going to so much trouble to avoid lock cmpxchg16b for the read side here when we only want one 8-byte half.)

这是一个带有CAS重试循环的代码示例,该重试循环可编译为外观正确的asm,并且我认为对于允许联合类型修剪的实现,它不含UB或其他不安全的C ++.它是用C风格(非成员函数等)编写的,但是如果您编写成员函数,它将是相同的.

Here's an example of code with a CAS retry-loop that compiles to asm that looks right, and I think is free of UB or other unsafe C++ for implementations that allow union type punning. It's written in C style (non-member functions, etc.) but it would be the same if you wrote member functions.

请参阅与ASM输出的

See the code with asm output from gcc6.3 on the Godbolt compiler explorer. With -m32, it uses cmpxchg8b the same way 64-bit code uses cmpxchg16b. With -mx32 (32-bit pointers in long mode) it can simply use a 64-bit cmpxchg, and normal 64-bit integer loads to grab both members in one atomic load.

这是可移植的C ++ 11(联合类型绑定除外),没有特定于x86的内容.不过,对目标来说,只有有效可以CAS一个大小为两个指针的对象.正如在Godbolt上看到的那样,它可以编译为对__atomic_compare_exchange_16库函数的调用,该函数用于ARM/ARM64和MIPS64.

This is portable C++11 (except the union type-punning), with nothing x86-specific. It's only efficient on targets that can CAS an object the size of two pointers, though. e.g. it compiles to a call to a __atomic_compare_exchange_16 library function for ARM / ARM64 and MIPS64, as you can see on Godbolt.

它不能在MSVC上编译,其中atomic<counted_ptr>大于counted_ptr_separate,因此static_assert会捕获它.大概MSVC在原子对象中包括一个锁定成员.

It doesn't compile on MSVC, where atomic<counted_ptr> is larger than counted_ptr_separate, so the static_assert catches it. Presumably MSVC includes a lock member in the atomic object.

#include <atomic>
#include <stdint.h>

using namespace std;

struct node {
  // This alignas is essential for clang to use cmpxchg16b instead of a function call
  // Apparently just having it on the union member isn't enough.
  struct alignas(2*sizeof(node*)) counted_ptr {
    node *    ptr;
    uintptr_t count;  // use pointer-sized integers to avoid padding
  };

  // hack to allow reading just the pointer without lock-cmpxchg16b,
  // but still without any C++ data race
  struct counted_ptr_separate {
    atomic<node *>    ptr;
    atomic<uintptr_t> count_separate;  // var name emphasizes that accessing this way isn't atomic with ptr
  };

  static_assert(sizeof(atomic<counted_ptr>) == sizeof(counted_ptr_separate), "atomic<counted_ptr> isn't the same size as the separate version; union type-punning will be bogus");
  //static_assert(std::atomic<counted_ptr>{}.is_lock_free());

  union {  // anonymous union: the members are directly part of struct node
    alignas(2*sizeof(node*)) atomic<counted_ptr> next_and_count;
    counted_ptr_separate  next;
  };
  // TODO: write member functions to read next.ptr or read/write next_and_count

  int data[4];
};


// make sure read-only access is efficient.
node *follow(node *p) {   // good asm, just a mov load
  return p->next.ptr.load(memory_order_acquire);
}
node *follow_nounion(node *p) {  // really bad asm, using cmpxchg16b to load the whole thing
  return p->next_and_count.load(memory_order_acquire).ptr;
}


void update_next(node &target, node *desired)
{
  // read the old value efficiently to avoid overhead for the no-contention case
  // tearing (or stale data from a relaxed load) will just lead to a retry
  node::counted_ptr expected = {
      target.next.ptr.load(memory_order_relaxed),
      target.next.count_separate.load(memory_order_relaxed) };

  bool success;
  do {
    node::counted_ptr newval = { desired, expected.count + 1 };
    // x86-64: compiles to cmpxchg16b
    success = target.next_and_count.compare_exchange_weak(
                           expected, newval, memory_order_acq_rel);
    // updates exected on failure
  } while( !success );
}

从clang 4.0 -O3 -mcx16输出的asm为:

The asm output from clang 4.0 -O3 -mcx16 is:

update_next(node&, node*):
    push    rbx             # cmpxchg16b uses rbx implicitly so it has to be saved/restored
    mov     rbx, rsi
    mov     rax, qword ptr [rdi]     # load the pointer
    mov     rdx, qword ptr [rdi + 8] # load the counter
.LBB2_1:                        # =>This Inner Loop Header: Depth=1
    lea     rcx, [rdx + 1]
    lock
    cmpxchg16b      xmmword ptr [rdi]
    jne     .LBB2_1
    pop     rbx
    ret

gcc进行了一些笨拙的存储/重载,但逻辑基本相同.

gcc does some clunky store/reloads, but is basically the same logic.

follow(node*)会编译为mov rax, [rdi]/ret,因此,由于受到了联合黑客的攻击,对指针的只读访问应有的便宜.

follow(node*) compiles to mov rax, [rdi] / ret, so read-only access to the pointer is as cheap as it should be, thanks to the union hack.

它的确依赖于通过一个成员编写一个联合并通过另一个成员读取一个联合,从而仅使用该指针即可高效地读取指针,而无需使用lock cmpxchg16b .保证可以在GNU C ++(和ISO C99/C11)中工作,但不能在ISO C ++中工作.许多其他C ++编译器确实保证联合类型处理是有效的,但是即使没有这样做,它也可能仍然有效:我们一直在使用std::atomic加载,这些加载必须假定该值是异步修改的.因此,我们应该避免出现类似别名的问题,即在通过另一个指针(或并集成员)写入值后,仍然认为寄存器中的值仍然有效.但是,编译器认为编译器认为独立的事物可能会出现问题.

It does depend on writing a union through one member and reading it through another, to do efficient reads of just the pointer without using a lock cmpxchg16b. This is guaranteed to work in GNU C++ (and ISO C99/C11), but not in ISO C++. Many other C++ compilers do guarantee that union type-punning works, but even without that it would probably still work: We're always using std::atomic loads which have to assume that the value was modified asynchronously. So we should be immune from aliasing-like problems where values in registers are still considered live after writing the value through another pointer (or union member). Compile-time reordering of things the compiler thinks are independent could be a problem, though.

以原子方式在指针+计数器的原子cmpxchg之后仅读取指针仍应为您提供x86上的获取/释放语义,但是我不认为ISO C ++对此有任何说明.我们猜想,广泛的发行版存储区(作为compare_exchange_weak的一部分)将与大多数体系结构(如x86上的地址)的相同地址上的较窄负载同步,但是AFAIK C ++ std::atomic不保证有关类型- unning.

Atomically reading just the pointer after an atomic cmpxchg of the pointer+counter should still give you acquire/release semantics on x86, but I don't think ISO C++ says anything about it. I would guess that a wide release-store (as part of the compare_exchange_weak will synchronize with a narrower load from the same address on most architectures (like it does on x86), but AFAIK C++ std::atomic doesn't guarantee anything about type-punning.

与指针+ ABA计数器无关,但可能会出现在使用联合以允许访问较大原子对象子集的其他应用程序中:请勿使用联合允许原子存储仅包含指针或只是柜台.如果您关心与该对的获取负载同步,至少不会.甚至排序有序的x86也可以

Not relevant for pointer + ABA-counter, but could come up for other applications of using a union to allow access to subsets of larger atomic object: Don't use the union to allow atomic stores to just the pointer or just the counter. At least not if you care about synchronization with an acquire-load of the pair. Even strongly-ordered x86 can reorder a narrow store with a wider load that fully contains it. Everything is still atomic, but you get into weird territory as far as memory-ordering goes.

在x86-64上,16B原子加载需要一个lock cmpxchg16b(这是一个完整的内存屏障,可防止前面的窄存储在其后变得全局可见).但是,如果将其与32位指针(或32位数组索引)一起使用,则可能会很容易出现问题,因为两个半部都可以正常加载64b加载.而且我不知道如果您需要与其他线程同步而不仅仅是原子性,那么在其他体系结构上会遇到什么问题.

On x86-64, an atomic 16B load requires a lock cmpxchg16b (which is a full memory barrier, preventing a preceding narrow store from becoming globally visible after it). But you could easily have a problem if you were using this with 32-bit pointers (or 32-bit array indices), since both halves could be loaded with a regular 64b load. And I have no idea what problems you could see on other architectures, if you need synchronization with other threads and not just atomicity.

要详细了解std :: memory_order的获取和发布,请参见 查看全文

登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆