原子双浮点或SSE/AVX向量在x86_64上加载/存储 [英] Atomic double floating point or SSE/AVX vector load/store on x86_64

查看：87 发布时间：2020/7/10 1:00:37 c++ assembly vectorization x86-64 stdatomic

本文介绍了原子双浮点或SSE/AVX向量在x86_64上加载/存储的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

这里(还有一些SO问题)，我发现C ++确实没有不支持无锁std::atomic<double>之类的东西，还不能支持诸如原子AVX/SSE矢量之类的东西，因为它依赖于CPU(尽管如今我知道的CPU，ARM，AArch64和x86_64都有矢量).

但是在x86_64中的double或向量上是否有原子级操作的程序集级支持?如果是这样，支持哪些操作(例如加载，存储，加，减，乘)? MSVC ++ 2017在atomic<double>中实现无锁的哪些操作?

解决方案

C ++不支持无锁std::atomic<double>
之类的东西

实际上，C ++ 11 std::atomic<double>在典型的C ++实现中是无锁的，并且确实公开了在x86上使用float/double进行无锁编程时在asm中几乎可以做的所有事情(例如，加载，商店和CAS足以执行任何操作:为什么不是原子双精度完全实施).但是，当前的编译器并不总是有效地编译atomic<double>.

C ++ 11 std :: atomic没有用于英特尔的事务存储扩展(TSX)的API )(用于FP或整数). TSX可能会改变游戏规则，尤其是对于FP/SIMD，因为它会消除xmm和整数寄存器之间的跳动数据的所有开销.如果事务不会中止，那么您对double或vector加载/存储所做的一切都是原子发生的.

某些非x86硬件支持对float/double和C ++进行原子加法 LL/SC 原子而不是x86风格内存的硬件-destination指令(例如ARM和大多数其他RISC CPU)可以在没有CAS的情况下在double和float上执行原子RMW操作，但是您仍然必须将数据从FP获取到整数寄存器，因为LL/SC通常仅适用于整数reg，例如x86的cmpxchg.但是，如果硬件通过仲裁LL/SC对来避免/减少活锁，则在竞争非常激烈的情况下，它比使用CAS循环的效率要高得多.如果您对算法进行了设计，以致争用很少见，那么fetch_add的LL/add/SC重试循环与负载+ add + LL/SC CAS重试循环之间可能只有很小的代码大小差异.

x86天然对齐的负载和存储是原子的最多8个字节，甚至x87或SSE . (例如，即使在32位模式下，movsd xmm0, [some_variable]也是原子的).实际上，gcc使用x87 fild/fistp或SSE 8B加载/存储来实现std::atomic<int64_t>加载和存储在32位代码中.

具有讽刺意味的是，编译器(gcc7.1，clang4.0，ICC17，MSVC CL19)在64位代码(或具有SSE2的32位代码)中表现不佳，并通过整数寄存器退回数据，而不仅仅是执行movsd直接从xmm regs( https://gcc.gnu.org/bugzilla/show_bug.cgi?id= 80820 和我报告的相关错误.即使对于-mtune=generic，这也是一个糟糕的选择. AMD在整数和向量regs之间对movq的延迟很高，但对于存储/重新加载也有很高的延迟.使用默认的-mtune=generic，load()编译为:

//    mov     rax, QWORD PTR ad[rip]
//    mov     QWORD PTR [rsp-8], rax   # store/reload integer->xmm
//    movsd   xmm0, QWORD PTR [rsp-8]
//    ret

在xmm和整数寄存器之间移动数据使我们进入下一个主题:

原子读-修改-写(例如fetch_add)是另一个故事:直接支持带有lock xadd [mem], eax之类的整数(请参见).对于其他事物，例如atomic<struct>或atomic<double>， x86上的唯一选项是使用cmpxchg(或TSX)的重试循环.

原子比较和交换(CAS)可用作锁-适用于任何原子RMW操作的免费构建基块，最大硬件支持的CAS宽度.在x86-64上， 16个字节带有cmpxchg16b (在某些第一代AMD K8上不可用，因此对于gcc，必须使用-mcx16或-march=whatever来启用它)./p>

gcc使exchange()的最佳组合成为可能:

double exchange(double x) {
    return ad.exchange(x); // seq_cst
}
    movq    rax, xmm0
    xchg    rax, QWORD PTR ad[rip]
    movq    xmm0, rax
    ret
  // in 32-bit code, compiles to a cmpxchg8b retry loop


void atomic_add1() {
    // ad += 1.0;           // not supported
    // ad.fetch_or(-0.0);   // not supported
    // have to implement the CAS loop ourselves:

    double desired, expected = ad.load(std::memory_order_relaxed);
    do {
        desired = expected + 1.0;
    } while( !ad.compare_exchange_weak(expected, desired) );  // seq_cst
}

    mov     rax, QWORD PTR ad[rip]
    movsd   xmm1, QWORD PTR .LC0[rip]
    mov     QWORD PTR [rsp-8], rax    # useless store
    movq    xmm0, rax
    mov     rax, QWORD PTR [rsp-8]    # and reload
.L8:
    addsd   xmm0, xmm1
    movq    rdx, xmm0
    lock cmpxchg    QWORD PTR ad[rip], rdx
    je      .L5
    mov     QWORD PTR [rsp-8], rax
    movsd   xmm0, QWORD PTR [rsp-8]
    jmp     .L8
.L5:
    ret

compare_exchange始终进行按位比较，因此您不必担心IEEE语义中负零(-0.0)等于+0.0的事实，或者NaN是无序的.但是，如果尝试检查desired == expected并跳过CAS操作，则可能会出现问题.对于足够新的编译器，请 memcmp(&expected, &desired, sizeof(double)) == 0 可能是表达C ++中FP值按位比较的一种好方法.只要确保避免误报，就可以避免误报.假阴性只会导致不必要的CAS.

由硬件仲裁的lock or [mem], 1绝对比在lock cmpxchg重试循环上旋转多个线程更好.与整数内存目标操作相比，每次内核访问高速缓存行但失败后，其cmpxchg都会浪费吞吐量，而整数内存目标操作只有在进入高速缓存行后才会成功.

IEEE浮点数的一些特殊情况可以通过整数运算来实现.例如atomic<double>的绝对值可以用lock and [mem], rax完成(其中RAX具有除符号位设置以外的所有位).或通过将1与符号位进行或"运算来将浮点数/双精度数强制为负.或使用XOR切换其符号.您甚至可以使用lock add [mem], 1原子地将其幅度增加1 ulp. (但前提是您可以确保以...开头不是无限... nextafter() 是一个有趣的功能，这要归功于IEEE754的超酷设计，它具有带偏差的指数，使得从尾数到指数的进位实际上可以工作.)

可能没有办法用C ++来表达这一点，这将使编译器可以在使用IEEE FP的目标上为您做到这一点.因此，如果需要它，可能必须自己进行类型调整为atomic<uint64_t>或类似的操作，并检查FP字节序是否匹配整数字节序等.(或仅对x86这样做.大多数其他目标都具有LL/SC而不是内存目标锁定操作.)

尚不支持原子AVX/SSE矢量之类的东西，因为它取决于CPU

正确.在高速缓存一致性系统中，无法检测到128b或256b的存储或加载是原子的. ( https://gcc.gnu.org/bugzilla/show_bug.cgi?id= 70490 ).当通过窄协议在高速缓存之间传输高速缓存行时，即使是在L1D与执行单元之间进行原子传输的系统，也可能在8B块之间撕裂.实际示例:多路插槽Opteron带有HyperTransport互连的K10 似乎在单个插槽中具有16B原子加载/存储，但是不同插槽上的线程可以观察到撕裂.

但是，如果您有一个对齐的double共享数组，则应该能够在其上使用矢量加载/存储，而不会在任何给定的double内部撕裂"风险.

矢量加载/存储的每个元素的原子性并收集/散布?

我认为可以肯定的是，对齐的32B加载/存储是通过不重叠的8B或更宽的加载/存储完成的，尽管Intel不能保证.对于不结盟的行动，假设任何事情可能都不安全.

如果您需要16B的原子负载，则唯一的选择是lock cmpxchg16b，并使用desired=expected .如果成功，它将用自身替换现有值.如果失败，那么您将获得旧内容. (正确的情况:此负载"在只读存储器上出错，因此请小心传递给执行此功能的函数的指针.)此外，与实际的只读负载相比，其性能当然是可怕的，因为它可能会导致内存丢失.处于共享状态的高速缓存行，这并不是完整的内存屏障.

16B原子存储和RMW都可以使用lock cmpxchg16b的明显方式.这使得纯存储比常规矢量存储要昂贵得多，尤其是如果cmpxchg16b必须重试多次，但是原子RMW已经很昂贵了.

将向量数据移入/移出整数reg的额外指令不是免费的，但与lock cmpxchg16b相比也并不昂贵.

# xmm0 -> rdx:rax, using SSE4
movq   rax, xmm0
pextrq rdx, xmm0, 1


# rdx:rax -> xmm0, again using SSE4
movq   xmm0, rax
pinsrq xmm0, rdx, 1

在C ++ 11术语中:

即使以最佳方式实现，即使对于只读或仅写操作(使用cmpxchg16b)，

atomic<__m128d>也会很慢. atomic<__m256d>甚至都不是无锁的.

理论上，

alignas(64) atomic<double> shared_buffer[1024];仍将允许对其进行读写的代码自动矢量化，只需要movq rax, xmm0，然后在xchg或cmpxchg上对double上的原子RMW进行自动矢量化. (在32位模式下，cmpxchg8b可以使用.)但是，您几乎可以肯定地不能从编译器那里获得好的asm！

您可以自动更新16B对象，但可以分别自动读取8B半部分. (我认为，这对于x86上的内存排序是安全的:请参阅 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80835 ).

但是，编译器没有提供任何清晰的方式来表达这一点.我破解了适用于gcc/clang的Union Type-Punning工具:如何使用c ++ 11 CAS实现ABA计数器?.但是gcc7和更高版本不会内联cmpxchg16b，因为它们正在重新考虑16B对象是否应该真正将自己呈现为无锁". ( https://gcc.gnu.org/ml/gcc- patch/2017-01/msg02344.html ).

Here (and in a few SO questions) I see that C++ doesn't support something like lock-free std::atomic<double> and can't yet support something like atomic AVX/SSE vector because it's CPU-dependent (though nowadays of CPUs I know, ARM, AArch64 and x86_64 have vectors).

But is there assembly-level support for atomic operations on doubles or vectors in x86_64? If so, which operations are supported (like load, store, add, subtract, multiply maybe)? Which operations does MSVC++2017 implement lock-free in atomic<double>?

解决方案

C++ doesn't support something like lock-free std::atomic<double>

Actually, C++11 std::atomic<double> is lock-free on typical C++ implementations, and does expose nearly everything you can do in asm for lock-free programming with float/double on x86 (e.g. load, store, and CAS are enough to implement anything: Why isn't atomic double fully implemented). Current compilers don't always compile atomic<double> efficiently, though.

C++11 std::atomic doesn't have an API for Intel's transactional-memory extensions (TSX) (for FP or integer). TSX could be a game-changer especially for FP / SIMD, since it would remove all overhead of bouncing data between xmm and integer registers. If the transaction doesn't abort, whatever you just did with double or vector loads/stores happens atomically.

Some non-x86 hardware supports atomic add for float/double, and C++ p0020 is a proposal to add fetch_add and operator+= / -= template specializations to C++'s std::atomic<float> / <double>.

Hardware with LL/SC atomics instead of x86-style memory-destination instruction, such as ARM and most other RISC CPUs, can do atomic RMW operations on double and float without a CAS, but you still have to get the data from FP to integer registers because LL/SC is usually only available for integer regs, like x86's cmpxchg. However, if the hardware arbitrates LL/SC pairs to avoid/reduce livelock, it would be significantly more efficient than with a CAS loop in very-high-contention situations. If you've designed your algorithms so contention is rare, there's maybe only a small code-size difference between an LL/add/SC retry-loop for fetch_add vs. a load + add + LL/SC CAS retry loop.

x86 natually-aligned loads and stores are atomic up to 8 bytes, even x87 or SSE. (For example movsd xmm0, [some_variable] is atomic, even in 32-bit mode). In fact, gcc uses x87 fild/fistp or SSE 8B loads/stores to implement std::atomic<int64_t> load and store in 32-bit code.

Ironically, compilers (gcc7.1, clang4.0, ICC17, MSVC CL19) do a bad job in 64-bit code (or 32-bit with SSE2 available), and bounce data through integer registers instead of just doing movsd loads/stores directly to/from xmm regs (see it on Godbolt):

#include <atomic>
std::atomic<double> ad;

void store(double x){
    ad.store(x, std::memory_order_release);
}
//  gcc7.1 -O3 -mtune=intel:
//    movq    rax, xmm0               # ALU xmm->integer
//    mov     QWORD PTR ad[rip], rax
//    ret

double load(){
    return ad.load(std::memory_order_acquire);
}
//    mov     rax, QWORD PTR ad[rip]
//    movq    xmm0, rax
//    ret

Without -mtune=intel, gcc likes to store/reload for integer->xmm. See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80820 and related bugs I reported. This is a poor choice even for -mtune=generic. AMD has high latency for movq between integer and vector regs, but it also has high latency for a store/reload. With the default -mtune=generic, load() compiles to:

//    mov     rax, QWORD PTR ad[rip]
//    mov     QWORD PTR [rsp-8], rax   # store/reload integer->xmm
//    movsd   xmm0, QWORD PTR [rsp-8]
//    ret

Moving data between xmm and integer register brings us to the next topic:

Atomic read-modify-write (like fetch_add) is another story: there is direct support for integers with stuff like lock xadd [mem], eax (see Can num++ be atomic for 'int num'? for more details). For other things, like atomic<struct> or atomic<double>, the only option on x86 is a retry loop with cmpxchg (or TSX).

Atomic compare-and-swap (CAS) is usable as a lock-free building-block for any atomic RMW operation, up to the max hardware-supported CAS width. On x86-64, that's 16 bytes with cmpxchg16b (not available on some first-gen AMD K8, so for gcc you have to use -mcx16 or -march=whatever to enable it).

gcc makes the best asm possible for exchange():

double exchange(double x) {
    return ad.exchange(x); // seq_cst
}
    movq    rax, xmm0
    xchg    rax, QWORD PTR ad[rip]
    movq    xmm0, rax
    ret
  // in 32-bit code, compiles to a cmpxchg8b retry loop


void atomic_add1() {
    // ad += 1.0;           // not supported
    // ad.fetch_or(-0.0);   // not supported
    // have to implement the CAS loop ourselves:

    double desired, expected = ad.load(std::memory_order_relaxed);
    do {
        desired = expected + 1.0;
    } while( !ad.compare_exchange_weak(expected, desired) );  // seq_cst
}

    mov     rax, QWORD PTR ad[rip]
    movsd   xmm1, QWORD PTR .LC0[rip]
    mov     QWORD PTR [rsp-8], rax    # useless store
    movq    xmm0, rax
    mov     rax, QWORD PTR [rsp-8]    # and reload
.L8:
    addsd   xmm0, xmm1
    movq    rdx, xmm0
    lock cmpxchg    QWORD PTR ad[rip], rdx
    je      .L5
    mov     QWORD PTR [rsp-8], rax
    movsd   xmm0, QWORD PTR [rsp-8]
    jmp     .L8
.L5:
    ret

compare_exchange always does a bitwise comparison, so you don't need to worry about the fact that negative zero (-0.0) compares equal to +0.0 in IEEE semantics, or that NaN is unordered. This could be an issue if you try to check that desired == expected and skip the CAS operation, though. For new enough compilers, memcmp(&expected, &desired, sizeof(double)) == 0 might be a good way to express a bitwise comparison of FP values in C++. Just make sure you avoid false positives; false negatives will just lead to an unneeded CAS.

Hardware-arbitrated lock or [mem], 1 is definitely better than having multiple threads spinning on lock cmpxchg retry loops. Every time a core gets access to the cache line but fails its cmpxchg is wasted throughput compared to integer memory-destination operations that always succeed once they get their hands on a cache line.

Some special cases for IEEE floats can be implemented with integer operations. e.g. absolute value of an atomic<double> could be done with lock and [mem], rax (where RAX has all bits except the sign bit set). Or force a float / double to be negative by ORing a 1 into the sign bit. Or toggle its sign with XOR. You could even atomically increase its magnitude by 1 ulp with lock add [mem], 1. (But only if you can be sure it wasn't infinity to start with... nextafter() is an interesting function, thanks to the very cool design of IEEE754 with biased exponents that makes carry from mantissa into exponent actually work.)

There's probably no way to express this in C++ that will let compilers do it for you on targets that use IEEE FP. So if you want it, you might have to do it yourself with type-punning to atomic<uint64_t> or something, and check that FP endianness matches integer endianness, etc. etc. (Or just do it only for x86. Most other targets have LL/SC instead of memory-destination locked operations anyway.)

can't yet support something like atomic AVX/SSE vector because it's CPU-dependent

Correct. There's no way to detect when a 128b or 256b store or load is atomic all the way through the cache-coherency system. (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70490). Even a system with atomic transfers between L1D and execution units can get tearing between 8B chunks when transferring cache-lines between caches over a narrow protocol. Real example: a multi-socket Opteron K10 with HyperTransport interconnects appears to have atomic 16B loads/stores within a single socket, but threads on different sockets can observe tearing.

But if you have a shared array of aligned doubles, you should be able to use vector loads/stores on them without risk of "tearing" inside any given double.

Per-element atomicity of vector load/store and gather/scatter?

I think it's safe to assume that an aligned 32B load/store is done with non-overlapping 8B or wider loads/stores, although Intel doesn't guarantee that. For unaligned ops, it's probably not safe to assume anything.

If you need a 16B atomic load, your only option is to lock cmpxchg16b, with desired=expected. If it succeeds, it replaces the existing value with itself. If it fails, then you get the old contents. (Corner-case: this "load" faults on read-only memory, so be careful what pointers you pass to a function that does this.) Also, the performance is of course horrible compared to actual read-only loads that can leave the cache line in Shared state, and that aren't full memory barriers.

16B atomic store and RMW can both use lock cmpxchg16b the obvious way. This makes pure stores much more expensive than regular vector stores, especially if the cmpxchg16b has to retry multiple times, but atomic RMW is already expensive.

The extra instructions to move vector data to/from integer regs are not free, but also not expensive compared to lock cmpxchg16b.

# xmm0 -> rdx:rax, using SSE4
movq   rax, xmm0
pextrq rdx, xmm0, 1


# rdx:rax -> xmm0, again using SSE4
movq   xmm0, rax
pinsrq xmm0, rdx, 1

In C++11 terms:

atomic<__m128d> would be slow even for read-only or write-only operations (using cmpxchg16b), even if implemented optimally. atomic<__m256d> can't even be lock-free.

alignas(64) atomic<double> shared_buffer[1024]; would in theory still allow auto-vectorization for code that reads or writes it, only needing to movq rax, xmm0 and then xchg or cmpxchg for atomic RMW on a double. (In 32-bit mode, cmpxchg8b would work.) You would almost certainly not get good asm from a compiler for this, though!

You can atomically update a 16B object, but atomically read the 8B halves separately. (I think this is safe with respect to memory-ordering on x86: see my reasoning at https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80835).

However, compilers don't provide any clean way to express this. I hacked up a union type-punning thing that works for gcc/clang: How can I implement ABA counter with c++11 CAS?. But gcc7 and later won't inline cmpxchg16b, because they're re-considering whether 16B objects should really present themselves as "lock-free". (https://gcc.gnu.org/ml/gcc-patches/2017-01/msg02344.html).

这篇关于原子双浮点或SSE/AVX向量在x86_64上加载/存储的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

原子双浮点或SSE/AVX向量在x86_64上加载/存储 [英] Atomic double floating point or SSE/AVX vector load/store on x86_64

问题描述

相关文章

C/C++开发最新文章

热门教程

热门工具

登录关闭

原子双浮点或SSE/AVX向量在x86_64上加载/存储 [英] Atomic double floating point or SSE/AVX vector load/store on x86_64

问题描述

相关文章

C/C++开发最新文章

热门教程

热门工具

登录 关闭

登录关闭