STLR(B) 是否在 ARM64 上提供顺序一致性? [英] Does STLR(B) provide sequential consistency on ARM64?

查看:26
本文介绍了STLR(B) 是否在 ARM64 上提供顺序一致性?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于在原子数据类型的对象上执行的存储(例如,std::atomic),GCC 生成:

For stores performed on an object of an atomic data type (say, std::atomic<uint8_t>), GCC generates:

  • MOV 指令在 release-store (std::memory_order_release) 的情况下,
  • XCHG 指令,用于sequential-consistent-store (std::memory_order_seq_cst).
  • MOV instruction in case of release-store (std::memory_order_release),
  • XCHG instruction in case of sequential-consistent-store (std::memory_order_seq_cst).

当目标架构是 x86_64 时.但是,当是ARM64(AArch64)时,两种情况下,GCC都会生成相同的指令,即STLRB.没有生成其他指令(例如内存屏障),Clang 也会发生同样的情况.这是否意味着这个被描述为具有 store-relase 语义的指令实际上也提供了顺序一致性?

when target architecture is x86_64. However, when it is ARM64 (AArch64), in both cases, GCC generates the same instruction, namely STLRB. There are no other instructions generated (such as a memory barrier) and the same happens with Clang as well. Does it mean that this instruction, described as having the store-relase semantics, actually provides sequential consistency as well?

例如,如果在两个内核上运行的两个线程将使用 STLRB 执行存储到不同的内存位置,那么这两个存储的顺序是否唯一?保证所有其他线程都遵守相同的顺序?

For instance, if two threads running on two cores will perform stores with STLRB to different memory locations, is the order of these two stores unique? Such that all other threads are guaranteed to observe the same order?

我问的是,根据这个答案,与获取加载不同线程可能会观察到 release-stores 的不同顺序.为了观察相同的顺序,需要顺序一致性.

I am asking since, according to this answer, with acquire-loads, different threads may observe different order of release-stores. To observe the same order, sequential consistency is needed instead.

现场演示:https://godbolt.org/z/hajMKnd53

推荐答案

是的,stlr 本身就是 store-release,ldar 不能通过更早的stlr(即没有 StoreLoad 重新排序) - 它们之间的交互满足 acq/rel 没有的 seq_cst 要求的那部分.(ARMv8.3 ldapr 就像 ldar 没有这种交互,只是一个简单的获取负载,允许更有效的 acq_rel.)

Yup, stlr is store-release on its own, and ldar can't pass an earlier stlr (i.e. no StoreLoad reordering) - that interaction between them satisfies that part of the seq_cst requirements which acq / rel doesn't have. (ARMv8.3 ldapr is like ldar without that interaction, being only a plain acquire load, allowing more efficient acq_rel.)

所以在 ARMv8.3 上,seq_cst 和 acq/rel 的区别在于负载端.8.3 之前的 ARMv8 无法在仍然允许 StoreLoad 重新排序的同时执行 acq/rel,因此不幸的是,如果您在发布存储之后获取加载其他内容,速度会很慢.ARMv8.3 修复了这个问题,使 acq/rel 与 x86 一样高效.

So on ARMv8.3, the difference between seq_cst and acq / rel is in the load side. ARMv8 before 8.3 can't do acq / rel while still allowing StoreLoad reordering, so it's unfortunately slow if you acquire-load something else after a release-store. ARMv8.3 fixes that, making acq / rel as efficient as on x86.

在 x86 上,一切都是获取加载或释放存储(因此 acq_rel 是免费的),实现顺序一致性的最坏方法是对 seq_cst 存储进行完全屏障.(您希望原子负载便宜,并且代码使用默认的 seq_cst 内存顺序是很常见的.)

On x86, everything is an acquire load or release store (so acq_rel is free), and the least-bad way to achieve sequential consistency is by doing a full barrier on seq_cst stores. (You want atomic loads to be cheap, and it's going to be common for code to use the default seq_cst memory order.)

(C/C++11 到处理器的映射 讨论了想要廉价加载的权衡,如果您必须选择加载或存储来附加完整的障碍.)

(C/C++11 mappings to processors discusses that tradeoff of wanting cheap loads, if you have to pick either load or store to attach the full barrier to.)

另外,IRIW 试金石测试(所有线程都同意独立存储的顺序)由 ARMv8 内存模型保证,即使对于发布存储也是如此.它保证是multicopy-atomic",这意味着当一个存储对任何其他核心可见时,它同时对所有其他核心可见.这足以让所有核心就所有商店的总订单达成一致,达到他们可以通过两次获取负载观察到的任何限制.

Separately, the IRIW litmus test (all threads agreeing on the order of independent stores) is guaranteed by the ARMv8 memory model even for release stores. It's guaranteed to be "multicopy-atomic", which means that when a store becomes visible to any other core, it becomes visible to all other cores at the same time. This is sufficient for all cores to agree on a total order for all stores, to the limits of anything they can observe via two acquire loads.

实际上,这意味着存储只有通过提交到 L1d 缓存才变得可见,这是一致的.例如,不是通过共享物理核心的逻辑核心之间的存储转发,IRIW 在少数 POWER CPU 上重新排序的机制,可以在现实生活中产生效果.ARMv8 最初在纸面上允许这样做,但从来没有 ARM CPU 这样做过.他们加强了内存模型,以简单地保证未来的 CPU 不会像那样奇怪.请参见 简化 ARM 并发:多拷贝原子ARMv8 的公理和操作模型了解详情.

In practical terms, that means stores only become visible by committing to L1d cache, which is coherent. Not for example by store-forwarding between logical cores sharing a physical core, the mechanism for IRIW reordering on the few POWER CPUs that can produce the effect in real life. ARMv8 originally allowed that on paper, but no ARM CPUs ever did. They strengthened the memory model to simply guarantee that no future CPU would be weird like that. See Simplifying ARM Concurrency: Multicopy-Atomic Axiomatic and Operational Models for ARMv8 for details.

请注意,所有线程能够就订单达成一致的这种保证适用于 ARM64 上的所有存储,包括宽松的.(在具有一致共享内存的机器中,很少有硬件机制可以创建它,因此只有在罕见的 ISA 上 seq_cst 必须实际执行任何特定操作以防止它.)

Note that this guarantee of all threads being able to agree on an order applies to all stores on ARM64, including relaxed. (There are very few HW mechanisms that can create it, in a machine with coherent shared memory, so it's only on rare ISAs that seq_cst has to actually do anything specific to prevent it.)

x86 的 TSO(总存储订单)内存模型在名称中具有所需的属性.是的,它更强大,基本上是程序顺序加上带有存储转发的存储缓冲区.(因此,这允许 StoreLoad 重新排序,并且对于一个核心在它们之前查看自己的商店全局可见,但没有别的.忽略NT存储,NT从WC内存(如视频RAM)加载......)

x86's TSO (Total Store Order) memory model has the required property right in the name. And yes, it's much stronger, basically program-order plus a store-buffer with store-forwarding. (So this allows StoreLoad reordering, and for a core to see its own stores before they're globally visible, but nothing else. Ignoring NT stores, and NT loads from WC memory such as video RAM...)

这篇关于STLR(B) 是否在 ARM64 上提供顺序一致性?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆