在 ARM 上加载和存储重新排序 [英] Loads and stores reordering on ARM

查看:25
本文介绍了在 ARM 上加载和存储重新排序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我不是 ARM 专家,但至少在某些 ARM 架构上不会对这些存储和加载进行重新排序吗?

I'm not an ARM expert but won't those stores and loads be subjected to reordering at least on some ARM architectures?

  atomic<int> atomic_var; 
  int nonAtomic_var;
  int nonAtomic_var2;

  void foo()
  {       
          atomic_var.store(111, memory_order_relaxed);
          atomic_var.store(222, memory_order_relaxed);
  }

  void bar()
  {       
          nonAtomic_var = atomic_var.load(memory_order_relaxed);
          nonAtomic_var2 = atomic_var.load(memory_order_relaxed);
  }

我没有成功让编译器在它们之间设置内存屏障.

I've had no success in making the compiler put memory barriers between them.

我尝试过如下(在 x64 上):

I've tried something like below (on x64):

$ arm-linux-gnueabi-g++ -mcpu=cortex-a9 -std=c++11 -S -O1 test.cpp

我有:

_Z3foov:
          .fnstart
  .LFB331:
          @ args = 0, pretend = 0, frame = 0
          @ frame_needed = 0, uses_anonymous_args = 0
          @ link register save eliminated.
          movw    r3, #:lower16:.LANCHOR0
          movt    r3, #:upper16:.LANCHOR0
          mov     r2, #111
          str     r2, [r3]
          mov     r2, #222
          str     r2, [r3]
          bx      lr
          ;...
  _Z3barv:
          .fnstart
  .LFB332:
          @ args = 0, pretend = 0, frame = 0
          @ frame_needed = 0, uses_anonymous_args = 0
          @ link register save eliminated.
          movw    r3, #:lower16:.LANCHOR0
          movt    r3, #:upper16:.LANCHOR0
          ldr     r2, [r3]
          str     r2, [r3, #4]
          ldr     r2, [r3]
          str     r2, [r3, #8]
          bx      lr

是否从未在 ARM 上重新排序到同一位置的加载和存储?我在 ARM 文档中找不到这样的限制.

Are loads and stores to the same location never reordered on ARM? I couldn't find such restriction in the ARM docs.

我问的是 c++11 标准,该标准规定:

I'm asking in regard to the c++11 standard which states that:

对任何特定原子变量的所有修改都以特定于该原子变量的总顺序发生.

All modifications to any particular atomic variable occur in a total order that is specific to this one atomic variable.

推荐答案

单个变量的总顺序存在是因为缓存一致性 (MESI):存储无法从存储缓冲区提交进入 L1d 缓存并对其他线程全局可见,除非内核拥有对该缓存线的独占访问权限.(MESI 独占或修改状态.)

The total order for a single variable exists because of cache coherency (MESI): a store can't commit from the store buffer into L1d cache and become globally visible to other threads unless the core owns exclusive access to that cache line. (MESI Exclusive or Modified state.)

C++ 保证不需要在任何普通 CPU 架构上实现任何障碍,因为所有普通 ISA 都有一致的缓存,通常使用 MESI 的变体.这就是为什么 volatile 碰巧在主流 C++ 实现上作为 mo_relaxed atomic 的遗留/UB 版本工作(但通常不这样做).另请参阅何时将 volatile 与多线程结合使用?了解更多详情.

That C++ guarantee doesn't require any barriers to implement on any normal CPU architecture because all normal ISAs have coherent caches, normally using a variant of MESI. This is why volatile happens to work as a legacy / UB version of mo_relaxed atomic on mainstream C++ implementations (but generally don't do it). See also When to use volatile with multi threading? for more details.

(有些系统存在两种共享内存的不同 CPU,例如微控制器 + DSP,但 C++ std::thread 不会在不共享一致视图的内核之间启动线程的内存.因此编译器只需为同一内部共享一致性域中的 ARM 内核执行代码生成.)

(Some systems exist with two different kinds of CPU that share memory, e.g. microcontroller + DSP, but C++ std::thread won't start threads across cores that don't share a coherent view of that memory. So compilers only have to do code-gen for ARM cores in the same inner-shared coherency domain.)

对于任何给定的原子对象,所有线程的总修改顺序将始终存在(由您引用的 ISO C++ 标准保证),但您不提前知道它将是什么,除非您在线程之间建立同步.

For any given atomic object, a total order of modification by all threads will always exist (as guaranteed by the ISO C++ standard you quoted), but you don't know ahead of time what it's going to be unless you establish synchronization between threads.

例如这个程序的不同运行可以先加载两个负载,或者一个加载然后两个存储然后另一个加载.

e.g. different runs of this program could have both loads go first, or one load then both stores then the other load.

这个总顺序(对于单个变量)将与每个线程的程序顺序兼容,但是程序顺序的任意交错.

memory_order_relaxed 仅对该变量进行原子操作,而不是排序 wrt.还要别的吗.在编译时唯一固定的顺序是wrt.此线程对相同原子变量的其他访问.

memory_order_relaxed only atomic operation on that variable, not ordering wrt. anything else. The only ordering that's fixed at compile time is wrt. other accesses to the same atomic variable by this thread.

不同的线程会就this 变量的修改顺序达成一致,但可能对所有对象的全局修改顺序存在分歧.(ARMv8 使 ARM 内存模型具有多副本原子性,因此这是不可能的(并且可能没有真正的早期 ARM 违反了这一点),但是 POWER 在现实生活中确实允许两个独立的读取器线程在其他 2 个独立写入器的存储顺序上存在分歧线程.这称为 IRIW 重新排序.其他线程是否总是以相同的顺序看到对不同线程中不同位置的两次原子写入?)

Different threads will agree on the modification order for this variable, but might disagree on the global modification order for all objects. (ARMv8 made the ARM memory model multi-copy-atomic so this is impossible (and probably no real earlier ARM violated that), but POWER does in real life allow two independent reader threads to disagree on the order of stores by 2 other independent writer threads. This is called IRIW reordering. Will two atomic writes to different locations in different threads always be seen in the same order by other threads?)

当涉及多个变量时,IRIW 重新排序是可能的这一事实是(除其他外)为什么甚至需要说总修改顺序确实总是单独存在于每个单独的变量中.

The fact that IRIW reordering is a possibility when multiple variables are involved is (among other things) why it even needs to be said that a total modification order does always exist for each individual variable separately.

要存在全线程全序,您需要所有原子访问都使用 seq_cst,这将涉及障碍.但这当然仍然不能在编译时完全确定该顺序是什么;不同运行的不同时间将导致获取负载看到某个商店与否.

For an all-thread total order to exist, you need all your atomic accesses to use seq_cst, which would involve barriers. But that still wouldn't of course fully determine at compile time what that order will be; different timings on different runs will lead to acquire loads seeing a certain store or not.

是否从未在 ARM 上重新排序到同一位置的加载和存储?

Are loads and stores to the same location never reordered on ARM?

从单个线程编号内.如果您对一个内存位置进行多次存储,则程序顺序中的最后一个将始终显示为其他线程的最后一个.即一旦尘埃落定,内存位置将具有上次存储存储的值.其他任何事情都会打破线程重新加载自己的存储的程序顺序的错觉.

From within a single thread no. If you do multiple stores to a memory location, the last one in program order will always appear as the last to other threads. i.e. once the dust settles, the memory location will have the value stored by the last store. Anything else would break the illusion of program order for threads reloading their own stores.

C++ 标准中的某些排序保证甚至被称为写-写一致性"和其他类型的一致性.ISO C++ 没有明确要求一致的缓存(在需要显式刷新的 ISA 上实现是可能的),但效率不高.

Some of the ordering guarantees in the C++ standard are even called "write-write coherency" and other kinds of coherency. ISO C++ doesn't explicitly require coherent caches (an implementation on an ISA that needs explicit flushing is possible), but would not be efficient.

http://eel.is/c++draft/intro.races#19

[ 注意:前面的四个一致性要求有效地禁止编译器将原子操作重新排序为单个对象,即使这两个操作都是宽松加载.这有效地使大多数硬件提供的缓存一致性保证可用于 C++ 原子操作. — 尾注]

[ Note: The four preceding coherence requirements effectively disallow compiler reordering of atomic operations to a single object, even if both operations are relaxed loads. This effectively makes the cache coherence guarantee provided by most hardware available to C++ atomic operations. — end note ]

<小时>

以上大部分是关于修改顺序,而不是LoadLoad重新排序.


Most of the above is about modification order, not LoadLoad reordering.

那是另外一回事.C++ 保证读-读一致性,即同一线程对同一个原子对象的 2 次读取按程序顺序相对于彼此发生.

That is a separate thing. C++ guarantees read-read coherence, i.e. that 2 reads of the same atomic object by the same thread happen in program order relative to each other.

http://eel.is/c++draft/intro.races#16

如果原子对象 M 的值计算 A 发生在 M 的值计算 B 之前,并且 A 从 M 上的副作用 X 获取其值,则 B 计算的值应为 X 存储的值或由副作用 Y 存储在 M 上的值,其中 Y 按照 M 的修改顺序在 X 之后.[ 注意:此要求称为读-读一致性.— 尾注 ]

If a value computation A of an atomic object M happens before a value computation B of M, and A takes its value from a side effect X on M, then the value computed by B shall either be the value stored by X or the value stored by a side effect Y on M, where Y follows X in the modification order of M. [ Note: This requirement is known as read-read coherence. — end note ]

值计算"是对变量的读取又名加载.突出显示的短语是保证同一线程中的后续读取无法观察到来自其他线程的较早写入(早于他们已经看到的写入)的部分.

A "value computation" is a read aka load of a variable. The highlighted phrase is the part that guarantees that later reads in the same thread can't observe earlier writes from other threads (earlier than a write they already saw).

这是我之前链接的引用所谈论的 4 个条件之一.

That's one of the 4 conditions that the previous quote I linked was talking about.

编译器将其编译为两个普通 ARM 负载的事实足以证明 ARM ISA 也能保证这一点.(因为我们确信 ISO C++ 需要它.)

The fact that compilers compile it to two plain ARM loads is proof enough that the ARM ISA also guarantees this. (Because we know for sure that ISO C++ requires it.)

我不熟悉 ARM 手册,但大概在某处.

I'm not familiar with ARM manuals but presumably it's in there somewhere.

另见 ARM 和 POWER 的教程介绍宽松的内存模型 - 一篇论文,详细介绍了各种测试用例允许/不允许重新排序的内容.

See also A Tutorial Introduction to the ARM and POWER Relaxed Memory Models - a paper that goes into significant detail about what reorderings are/aren't allowed for various test cases.

这篇关于在 ARM 上加载和存储重新排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆