原子操作成本 [英] atomic operation cost

查看:103
本文介绍了原子操作成本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

原子操作(比较和交换或原子加/减的任何操作)的成本是多少?它消耗多少个周期?它会暂停SMP或NUMA上的其他处理器,还是会阻止内存访问? 会刷新乱序CPU中的重排序缓冲区吗?

What is the cost of the atomic operation (any of compare-and-swap or atomic add/decrement)? How much cycles does it consume? Will it pause other processors on SMP or NUMA, or will it block memory accesses? Will it flush reorder buffer in out-of-order CPU?

对缓存有什么影响?

我对现代流行的CPU感兴趣:x86,x86_64,PowerPC,SPARC,Itanium.

I'm interested in modern, popular CPUs: x86, x86_64, PowerPC, SPARC, Itanium.

推荐答案

我过去几天一直在寻找实际数据,却一无所获. 但是,我进行了一些研究,将原子操作的成本与缓存未命中的成本进行了比较.

I have looked for actual data for the past days, and found nothing. However, I did some research, which compares the cost of atomic ops with the costs of cache misses.

在奔腾Pro之前(如文档中所述),x86 LOCK前缀(包括原子CAS的lock cmpxchg)的成本是内存访问(如高速缓存未命中),+停止其他处理器的内存操作, +与试图锁定总线的其他处理器的任何争用.但是,由于PentiumPro,对于普通的Writeback可缓存内存(除非您直接与硬件进行交谈,否则应用程序处理的所有内存)都不会阻塞所有内存操作,而是仅阻塞相关的缓存行(基于@osgx的答案).

The cost of the x86 LOCK prefix, (including lock cmpxchg for atomic CAS), before PentiumPro (as described in the doc), is a memory access (like a cache miss), + stopping memory operations by other processors, + any contention with other processors trying to LOCK the bus. However, since PentiumPro, for normal Writeback cacheable memory (all memory an app deals with, unless you talk directly with hardware), instead of blocking all memory operations, only the relevant cache line is blocked (based on the link in @osgx's answer).

即核心会延迟应答线路的MESI份额和RFO请求,直到完成实际lock ed操作的存储部分之后.这被称为高速缓存锁",并且仅影响该一个高速缓存行.其他内核可以同时加载/存储,甚至可以对其他行进行CASing.

i.e. the core delays answering MESI share and RFO requests for the line until after the store part of the actual locked operation. This is called a "cache lock", and only affects that one cache line. Other cores can be loading / storing or even CASing other lines at the same time.

实际上,如

Actually, the CAS case can be more complicated, as explained on this page, with no timings but an insightful description by a trustworthy engineer. (At least for the normal use-case where you do a pure load before the actual CAS.)

在讨论过多细节之前,我先说一下LOCKed操作的代价是一次缓存未命中+与同一缓存行上其他处理器的可能争用,而CAS +先前的负载(除互斥锁外几乎总是需要,您始终将CAS 0和1)置于两个高速缓存未命中的位置.

Before going into too much detail, I'll say that a LOCKed operation costs one cache miss + the possible contention with other processor on the same cacheline, while CAS + the preceding load (which is almost always required except on mutexes, where you always CAS 0 and 1) can cost two cache misses.

他解释说,单个位置上的负载+ CAS实际上可能会导致两次高速缓存未命中,例如负载链接/存储条件(请参阅后者).他的解释依赖于 MESI缓存一致性协议的知识.它对高速缓存行使用4种状态:M(已修改),E(已包含),S(已hared),I(无效)(因此称为MESI),下面在需要时进行了说明.解释了以下情况:

He explains that a load + CAS on a single location can actually cost two cache misses, like Load-Linked/Store-Conditional (see there for the latter). His explaination relies on knowledge of the MESI cache coherence protocol. It uses 4 states for a cacheline: M(odified), E(xclusive), S(hared), I(nvalid) (and therefore it's called MESI), explained below where needed. The scenario, explained, is the following:

  • LOAD导致高速缓存未命中-在共享状态下从内存中加载了相关的高速缓存行(即,仍然允许其他处理器将该高速缓存行保留在内存中;在此状态下不允许进行任何更改).如果该位置在内存中,则将跳过此缓存未命中. 可能的成本:1个高速缓存未命中.(如果高速缓存行处于共享",互斥"或已修改"状态,即数据位于此CPU的L1高速缓存中,则跳过此操作.)
  • 程序计算要存储的新值,
  • ,它运行原子CAS指令.
    • 它必须避免并发修改,因此它必须从其他CPU的高速缓存中删除高速缓存行的副本,以将高速缓存行移至独占"状态. 可能的成本:1个缓存未命中.如果它已经是专有拥有的,即处于独占"或已修改"状态,则不需要此操作.在这两种状态下,没有其他CPU拥有高速缓存行,但是在排他"状态下,尚未对其进行修改(尚未).
    • 进行此通信后,在我们的CPU的本地缓存中修改了该变量,这时该变量对于所有其他CPU都是全局可见的(因为它们的缓存与我们的缓存一致).最终将根据常规算法将其写入主存储器.
    • 其他尝试读取或修改该变量的处理器将首先必须以共享"或独占"模式获取该缓存行,并且这样做将与该处理器联系并接收缓存行的更新版本. 相反,锁定操作只能造成高速缓存未命中(因为高速缓存行将直接以独占"状态请求).
    • the LOAD causes a cache miss - the relevant cacheline is loaded from memory in Shared state (i.e. other processors are still allowed to keep that cacheline in memory; no changes are allowed in this state). If the location is in memory, this cache miss is skipped. Possible cost: 1 cache miss. (skipped if the cacheline is in Shared, Exclusive or Modified state, i.e. the data is in this CPU's L1 cache).
    • the program calculates the new values to store,
    • and it runs an atomic CAS instruction.
      • It has to avoid concurrent modification, so it must remove copies of the cacheline from the cache of other CPUs, to move the cacheline to the Exclusive state. Possible cost: 1 cache miss. This is not needed if it is already exclusively owned, i.e. in the Exclusive or Modified state. In both states, no other CPUs hold the cacheline, but in the Exclusive state it has not been modified (yet).
      • After this communication, the variable is modified in our CPU's local cache, at which point it is globally visible to all other CPUs (because their caches are coherent with ours). It will eventually be written to main memory according to usual algorithms.
      • Other processors trying to read or modify that variable will first have to obtain that cacheline in Shared or Exclusive mode, and to do so will contact this processor and receive the updated version of the cacheline. A LOCKed operation, instead, can only cost a cache miss (because the cacheline will be requested directly in Exclusive state).

      在所有情况下,缓存行请求都可能已由其他已经修改数据的处理器暂停.

      In all cases, a cacheline request can be stalled by other processors already modifying the data.

      这篇关于原子操作成本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆