真正测试std :: atomic是否无锁 [英] Genuinely test std::atomic is lock-free or not

查看:137
本文介绍了真正测试std :: atomic是否无锁的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

由于std::atomic::is_lock_free()可能无法真正反映现实情况[ ref ],所以我正在考虑编写一个真正的运行时测试反而.但是,当我着手解决这个问题时,我发现这并不是我认为的微不足道的任务.我想知道是否有一些聪明的主意可以做到这一点.

Since std::atomic::is_lock_free() may not genuinely reflect the reality [ref], I'm considering writing a genuine runtime test instead. However, when I got down to it, I found that it's not a trivial task I thought it to be. I'm wondering whether there is some clever idea that could do it.

推荐答案

除了性能以外,该标准也不能保证任何方式保证;或多或少是关键.

Other than performance, the standard doesn't guarantee any way you can tell; that's more or less the point.

如果愿意引入某些特定于平台的UB,则可以执行将atomic<int64_t> *强制转换为volatile int64_t*的操作,并查看是否在其他线程读取对象时观察到撕裂".但这在32位x86上失败了,在32位x86上,无锁int64_t仅使用少量开销即可有效(使用SSE2或x87),但是volatile int64_t*将使用两个单独的4字节存储产生撕裂,这是大多数编译器对其进行编译的方式.

If you are willing to introduce some platform-specific UB, you could do something like cast a atomic<int64_t> * to a volatile int64_t* and see if you observe "tearing" when another thread reads the object. But that fails on 32-bit x86 where lock-free int64_t is efficient with only small overhead (using SSE2 or x87), but volatile int64_t* will produce tearing using two separate 4-byte stores the way most compilers compile it.

如果该测试成功(即普通的C ++类型自然仅用volatile是原子的),这将告诉您任何明智的编译器都将使其廉价地无锁.但是,如果失败了,它不会告诉您太多.该类型的无锁原子可能仅比用于加载/存储的普通版本稍微贵一点,否则编译器可能根本不会使其成为无锁.

If this test succeeds (i.e. the plain C++ type was naturally atomic with just volatile), that tells you any sane compiler will make it lock-free very cheaply. But if it fails, it doesn't tell you very much. A lock-free atomic for that type may be only slightly more expensive than the plain version for loads/stores, or the compiler may not make it lock-free at all.

在任何特定平台/目标体系结构上,您都可以在调试器中单步执行代码,并查看运行的asm指令. (包括进入诸如__atomic_store_16之类的libatomic函数调用).这是唯一100%可靠的方法.

On any specific platform / target architecture, you can single-step your code in a debugger and see what asm instructions run. (Including stepping into libatomic function calls like __atomic_store_16). This is the only 100% reliable way.

(有趣的事实:带有静态链接libatomic的gcc7 可能总是对x86-64上的16字节对象使用锁定,因为它没有机会在动态链接时进行运行时CPU检测,并​​且没有机会使用相同的机制在支持它的CPU上使用lock cmpxchg16b glibc用于为当前系统选择最佳的memcpy/strchr实现.)

(Fun fact: gcc7 with statically linked libatomic may always use locking for 16-byte objects on x86-64, because it doesn't have an opportunity to do runtime CPU detection at dynamic link time and use lock cmpxchg16b on CPUs that support it, with the same mechanism glibc uses to pick optimal memcpy / strchr implementations for the current system.)

您可以轻而易举地寻求性能差异(例如,具有多个读取器的可伸缩性),但是x86-64 lock cmpxchg16b无法扩展 1 .彼此不同,不像8字节或更窄的原子对象

You could portably look for a performance difference (e.g. scalability with multiple readers), but x86-64 lock cmpxchg16b doesn't scale1. Multiple readers contend with each other, unlike 8 byte and narrower atomic objects where pure asm loads are atomic and can be used. lock cmpxchg16b acquires exclusive access to a cache line before executing; abusing the side-effect of atomically loading the old value on failure to implement .load() is much worse than an 8-byte atomic load which compiles to just a regular load instruction.

这是gcc7决定停止在16字节对象上为is_lock_free()返回true的部分原因,如

That's part of the reason that gcc7 decided to stop returning true for is_lock_free() on 16-byte objects, as described in the GCC mailing list message about the change you're asking about.

还要注意,在32位x86上的clang使用lock cmpxchg8b来实现std::atomic<int64_t>,就像在64位模式下的16字节对象一样.因此,您也会发现它缺乏并行读取缩放. ( https://bugs.llvm.org/show_bug.cgi?id=33109)

Also note that clang on 32-bit x86 uses lock cmpxchg8b to implement std::atomic<int64_t>, just like for 16-byte objects in 64-bit mode. So you would see a lack of parallel read scaling with it, too. (https://bugs.llvm.org/show_bug.cgi?id=33109)

std::atomic<>实现通常仍然通过在每个对象中包含lock字节或字来使对象更大.它将改变ABI,但是无锁与锁定已经是ABI的区别.该标准允许这样做,但是奇怪的硬件即使在无锁的情况下也可能在对象中需要额外的字节.无论如何,sizeof(atomic<T>) == sizeof(T)都不会告诉您任何信息.如果它更大,则很可能是您的实现中添加了互斥体,但是您不能确定是否要检查asm.

std::atomic<> implementations that use locking usually still don't make the object larger by including a lock byte or word in each object. It would change the ABI, but lock-free vs. locking is already an ABI difference. The standard allows this, but weird hardware might need extra bytes in the object even when lock-free. Anyway sizeof(atomic<T>) == sizeof(T) doesn't tell you anything either way. If it's larger it's most likely that your implementation added a mutex, but you can't be sure without checking the asm.

通常的机制是将原子对象的地址用作全局锁哈希表的键.别名/冲突和共享同一锁的两个对象是额外的争用,但不是正确性问题.这些锁仅从库函数中获取/释放,而不是在持有其他此类锁的同时释放,因此它不会创建死锁.

The normal mechanism is to use the address of the atomic object as a key for a global hash table of locks. Two objects aliasing / colliding and sharing the same lock is extra contention, but not a correctness problem. These locks are only taken/released from library functions, not while holding other such locks, so it can't create a deadlock.

您可以通过使用两个不同进程之间的共享内存来检测到此情况(因此每个进程将具有自己的锁哈希表). 是C ++ 11原子< T>可用于mmap吗?

You could detect this by using shared memory between two different processes (so each process would have its own hash table of locks). Is C++11 atomic<T> usable with mmap?

  • 检查std::atomic<T>的大小是否与T相同(因此该锁不在对象本身中).

  • check that std::atomic<T> is the same size as T (so the lock isn't in the object itself).

从两个不共享任何地址空间的单独进程映射一个共享内存段.在每个进程中将其映射到不同的基址都没关系.

Map a shared memory segment from two separate processes that don't otherwise share any of their address space. It doesn't matter if you map it to a different base address in each process.

存储一个进程中的全一和全零之类的模式,而从另一进程中读取(并寻找撕裂).与我上面volatile的建议相同.

Store patterns like all-ones and all-zeros from one process while reading from the other (and look for tearing). Same as what I suggested with volatile above.

还要测试原子增量:让每个线程以1G增量递增,并每次检查结果是否为2G.即使纯负载和纯存储自然是原子的(撕裂测试),诸如fetch_add/operator++的读-修改-写操作也需要特殊支持:

Also test atomic increment: have each thread do 1G increments and check that the result is 2G every time. Even if pure load and pure store are naturally atomic (the tearing test), read-modify-write operations like fetch_add / operator++ need special support: Can num++ be atomic for 'int num'?

根据C ++ 11标准,其目的是对于无锁对象仍然应该是原子的.对于非无锁对象(如果它们将锁嵌入到对象中),它也可能起作用,这就是为什么您必须通过检查sizeof()排除这一点的原因.

From the C++11 standard, the intent is that this should still be atomic for lock-free objects. It might also work for non-lock-free objects (if they embed the lock in the object), which is why you have to rule that out by checking sizeof().

为便于通过共享内存进行进程间通信,我们的目的是无锁操作也应无地址.也就是说,通过两个不同的地址对同一内存位置进行的原子操作将进行原子通信. 实施不应取决于任何进程状态.

如果您发现两个进程之间存在裂痕,则说明该对象不是无锁的(至少不是C ++ 11的意图,也不是您期望的普通共享-内存CPU.)

If you see tearing between two processes, the object wasn't lock-free (at least not the way C++11 intended, and not the way you'd expect on normal shared-memory CPUs.)

我不确定,如果进程不必共享除包含原子对象 2 的一页以外的任何地址空间,为什么不使用地址就很重要. (当然,C ++ 11完全不需要实现使用页面.或者实现可能会将锁的哈希表放在每个页面的顶部或底部?在这种情况下,使用依赖于超出页面偏移量的地址位完全是愚蠢的.)

I'm not sure why address-free matters if the processes don't have to share any address-space other than 1 page containing the atomic object2. (Of course, C++11 doesn't require that the implementation uses pages at all. Or maybe an implementation could put the hash table of locks at the top or bottom of each page? In which case using a hash function that depended on address bits above the page offset would be totally silly.)

无论如何,这取决于许多假设,这些假设适用于所有正常CPU上的计算机工作方式,但不适用于C ++.如果您关心的实现是在主流CPU上进行的例如在正常操作系统下的x86或ARM,则此测试方法应该相当准确,并且可能是仅读取asm的替代方法. 在编译时自动执行并不是很实际的事情,但有可能自动执行这样的测试并将其放入构建脚本中,这与读取asm不同.

Anyway, this depends on a lot of assumptions about how computers work that are true on all normal CPUs, but which C++ doesn't make. If the implementation you care about is on a mainstream CPU like x86 or ARM under a normal OS, then this testing method should be fairly accurate and might be an alternative to just reading the asm. It's not something that's very practical to do automatically at compile time, but it would be possible to automate a test like this and put it into a build script, unlike reading the asm.

脚注1:x86上的16字节原子

没有x86硬件保证支持带有SSE指令的16字节原子加载/存储.实际上,许多现代CPU确实具有原子的movaps加载/存储,但是在Intel/AMD手册中并不能保证在奔腾及以后的8字节x87/MMX/SSE加载/存储中可以做到这一点.而且无法检测到哪些CPU/没有原子的128位操作(lock cmpxchg16b除外),因此编译器编写者无法安全地使用它们.

No x86 hardware guarantees support for 16-byte atomic load/store with SSE instructions. In practice many modern CPUs do have atomic movaps load/store, but there are no guarantees of this in Intel/AMD manuals the way there are for 8-byte x87/MMX/SSE loads/stores on Pentium and later. And no way to detect which CPUs do/don't have atomic 128-bit ops (other than lock cmpxchg16b), so compiler writers can't safely use them.

请参见 SSE指令: CPU可以执行原子16B内存操作吗?在一个讨厌的极端情况下:在K10上进行的测试表明,对齐的xmm加载/存储显示同一套接字上的线程之间没有撕裂,但是不同套接字上的线程却很少发生撕裂,因为显然HyperTransport仅给出8个字节对象的最小x86原子性保证. (如果lock cmpxchg16b在这样的系统上更昂贵,则为IDK.)

See SSE instructions: which CPUs can do atomic 16B memory operations? for a nasty corner case: testing on K10 shows that aligned xmm load/store shows no tearing between threads on the same socket, but threads on different sockets experience rare tearing because HyperTransport apparently only gives the minimum x86 atomicity guarantee of 8 byte objects. (IDK if lock cmpxchg16b is more expensive on a system like that.)

如果没有供应商的公开保证,我们也永远无法确定奇怪的微体系结构极端案例.在一个简单的测试中,没有一个线程编写模式而另一个线程进行读取的失败是一个很好的证据,但是在某些特殊情况下,CPU设计师总是决定以不同于正常的方式处理某些事情.

Without published guarantees from vendors, we can never be sure about weird microarchitectural corner cases, either. Lack of tearing in a simple test with one thread writing patterns and the other reading is pretty good evidence, but it's always possible that something could be different in some special case the CPU designers decided to handle a different way than normal.

只需要只读访问权限的指针+计数器结构可能很便宜,但是当前的编译器需要union hack才能使其仅对对象的前半部分执行8字节的原子加载. 如何使用c ++实现ABA计数器11 CAS?.对于ABA计数器,通常无论如何都要用CAS对其进行更新,因此缺少16字节的原子纯存储不是问题.

A pointer + counter struct where read-only access only needs the pointer can be cheap, but current compilers need union hacks to get them to do an 8-byte atomic load of just the first half of the object. How can I implement ABA counter with c++11 CAS?. For an ABA counter, you'd normally update it with a CAS anyway, so lack of a 16-byte atomic pure store is not a problem.

64位模式下的ILP32 ABI(32位指针)(例如 Linux的x32 ABI 或AArch64的ILP32 ABI)表示指针+整数只能容纳8个字节,但整数寄存器的宽度仍为8个字节.与使用8位指针的完整64位模式相比,使用指针+计数器原子对象效率更高.

An ILP32 ABI (32-bit pointers) in 64-bit mode (like Linux's x32 ABI, or AArch64's ILP32 ABI) means pointer+integer can fit in only 8 bytes, but integer registers are still 8 bytes wide. This makes it much more efficient to use a pointer+counter atomic object than in full 64-bit mode where a pointer is 8 bytes.

脚注2:无地址

我认为,术语无地址"是一个独立的主张,而不取决于任何按进程的状态.据我了解,这意味着正确性不依赖于两个线程在相同的内存位置使用相同的地址.但是,如果正确性还取决于它们共享同一个全局哈希表(IDK,为什么将对象的地址存储在对象本身中会有所帮助),则只有在同一个对象中可以为同一对象拥有多个地址的情况下,这才有意义过程.在诸如x86的实模式分段模型之类的东西上,这可能是 ,其中使用32位segment:offset寻址20位线性地址空间. (针对16位x86的实际C实现将分段暴露给程序员;可以将其隐藏在C的规则后面,但性能却不高.)

I think the term "address-free" is a separate claim from not depending on any per-process state. As I understand it, it means that correctness doesn't depend on both threads using the same address for the same memory location. But if correctness also depends on them sharing the same global hash table (IDK why storing the address of an object in the object itself would ever help), that would only matter if it was possible to have multiple addresses for the same object within the same process. That is possible on something like x86's real-mode segmentation model, where a 20-bit linear address space is addressed with 32-bit segment:offset. (Actual C implementations for 16-bit x86 exposed segmentation to the programmer; hiding it behind C's rules would be possible but not high performance.)

虚拟内存也有可能:同一过程中同一物理页面到不同虚拟地址的两个映射是可能的,但很奇怪.可能使用或可能不使用相同的锁,具体取决于哈希函数是否使用页面偏移量以上的任何地址位. (代表页面内偏移量的地址低位对于每个映射都是相同的.即,对于这些位的虚拟到物理转换是无操作的,这就是为什么

It's also possible with virtual memory: two mappings of the same physical page to different virtual addresses within the same process is possible but weird. That might or might not use the same lock, depending on whether the hash function uses any address bits above the page offset. (The low bits of an address, that represent the offset within a page, are the same for every mapping. i.e. virtual to physical translation for those bits is a no-op, which is why VIPT caches are usually designed to take advantage of that to get speed without aliasing.)

因此,即使非锁定对象使用单独的全局哈希表,而不是向原子对象添加互斥锁,也可能在单个进程中没有地址.但这将是非常不寻常的情况.使用虚拟内存技巧为 same 进程(在线程之间共享其所有地址空间)内的同一变量创建两个地址的情况极为罕见.进程之间共享内存中的原子对象更为常见. (我可能会误解无地址"的含义;可能意味着无地址空间",即不依赖于共享其他地址.)

So a non-lock-free object might be address-free within a single process, even if it uses a separate global hash table instead of adding a mutex to the atomic object. But this would be a very unusual situation; it's extremely rare to use virtual memory tricks to create two addresses for the same variable within the same process that shares all of its address-space between threads. Much more common would be atomic objects in shared memory between processes. (I may be misunderstanding the meaning of "address-free"; possibly it means "address-space free", i.e. lack of dependency on other addresses being shared.)

这篇关于真正测试std :: atomic是否无锁的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆