CAS 碰撞的 CPU 内部特性是什么? [英] What are the CPU-internal characteristics of CAS collision?

查看:20
本文介绍了CAS 碰撞的 CPU 内部特性是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试了解 x86/x64 上 CAS 的低级机制,我非常感谢一些帮助/见解.

I'm trying to understand the low-level mechanics of CAS on x86/x64 and I'd really appreciate some help/insight.

我一直在思考这个问题的原因是我试图对指数退避进行推理,并在原则上弄清楚退避延迟的正确单个单位应该是什么.

The reason I've been thinking about this is that I'm trying to reason about exponential backoff, and figure out in principle what the correct single unit of backoff delay should be.

如果我查看没有指数退避的无锁空闲列表基准测试,我会发现随着线程数量的增加,性能会迅速下降.

If I look at a lock-free freelist benchmark, without exponential backoff, I see as the number of threads increase, performance rapidly flatlines.

Release 7 Lock-Free Freelist Benchmark #1

   M
   N
   S
  L3U
L2U L2U
L1D L1D
L1I L1I
 P   P
L L L L total ops,mean ops/sec per thread,standard deviation,scalability
0 0 0 1 310134488,31013449,0,1.00
0 1 0 1 136313300,6815665,38365,0.22

0 1 0 1 136401284,6820064,50706,0.22
1 1 1 1 111134328,2778358,23851,0.09

0 0 1 1 334747444,16737372,2421,0.54
1 1 1 1 111105898,2777647,40399,0.09

正如我们所知,可能会发生活锁,其中每个线程阻止其他线程进行.

As we know, live-lock can occur, where each thread prevents the others from progressing.

我最初的想法——我相信现在是错误的——认为 CAS 正在干扰 CAS.我的意思是,如果它们同时发生,CAS 指令本身会与另一个 CAS 发生破坏性冲突.两者都会失败.(因为我一直在考虑以太网).

My original - and I believe now mistaken - thought was that CAS was interfering with CAS. By that I mean, the CAS instruction itself would destructively collide with another CAS, should they occur concurrently. Both would fail. (Prolly because I was in the back of my mind thinking about ethernet).

这显然"解释了结果——所有这些 CAS 指令同时运行,很少有机会在被破坏性中断之前完全执行.

This 'obviously' explains the results - all those CAS instructions operating concurrently, very few have a chance to fully execute before being destructively interrupted.

再想一想,我相信现在不可能了.CAS 指令实际上没有故障模式.它会告诉您目的地是否等于比较对象.就这样.它不会回来说哦,对不起,撞到了别人".

Having thought about it some more, I believe now this cannot be the case. The CAS instruction does not in fact HAVE a failure mode. It will tell you the destination is equal or not equal to the comparand. That's all. It doesn't come back and say "oh, sorry, bumped into someone else".

破坏性干扰正在发生,但它发生在数据结构算法本身的更高级别.当我们从/向空闲列表推送或弹出时,我们实际上是在尝试交换.我们需要目的地稳定足够长的时间,以便我们可以读取它,做我们需要做的任何工作,然后发现它没有变化,这样我们就可以完成我们的推送/弹出.

Destructive interference IS occurring, but it's occurring at a higher level, in the data structure algorithm itself. When we push or pop from/to the freelist, we are actually TRYING to swap. We need the destination to be stable for long enough that we can read it, do whatever work we need to do, and then find it unchanged so we can complete our push/pop.

如果其他线程保持 CASing,则目的地不稳定 - 它不断变化 - 我们必须不断重试我们的操作.

If other threads keep CASing, destination isn't stable - it keeps changing - and we keep having to retry our operation.

但现在我很困惑.

我们看到的是单个线程执行大约 3000 万次 push/pop 操作.目标必须在这些操作之一的持续时间内保持稳定,操作才能成功,所以我们看到有 3000 万个插槽".如果我们有两个线程,那么我们可以拥有的最大理论性能是每个线程 1500 万次操作;每个线程使用一半的插槽.

What we see is that a single thread performs about 30 million push/pop operations. Destination has to be stable for the duration of one of these operations, for the operation to succeed, so we see there are 30 million 'slots'. If we have two threads, then the maximum theoretical performance we can have is 15 million operations per thread; each thread using half the slots.

现在让我们回到 CAS.CAS 没有故障模式.那么当另一个线程已经在 CASing 时,第二个线程尝试 CAS 时会发生什么?好吧,第二个线程将在数据结构级别失败,因为交换无法发生,因此它将重试交换.

Now let's come back to CAS. CAS has no failure mode. So what's happening when second thread tries to CAS when another thread is already CASing? well, the second thread will fail at the data structure level, since the swap couldn't occur, so it will retry the swap.

但是现在想象一下我们有很多线程.开始 CAS 的第一个线程将成功(假设每个 CAS 花费的时间完全相同 - 不正确,但该假设不会改变任何基本的东西,因此可以推理).所有其他人都会失败.

But now imagine we have LOTS of threads. The first thread to begin a CAS will succeed (assuming each CAS takes exactly the same time - not true, but that assumption doesn't change anything fundamental, so it's okay to reason with). All the others will fail.

但是一旦第一个线程完成,下一个读取新目标值的线程就会成功执行他的 CAS(而所有其他线程,仍在执行他们的 CAS 或现在开始新的 CAS,将失败).

But once the first thread has finished, the very next thread who reads the new destination value will have his CAS succeed (and all the other threads, still executing their CASs or now beginning new CASs, will fail).

那么为什么我们看不到完美的缩放比例?因为每个插槽"都应该被使用!

So why do we not see perfect scaling? because every 'slot' should be being used!

我认为因此我没有正确理解 CAS.

I think therefore I do not understand CAS properly.

阅读英特尔的架构软件开发人员手册,我发现如果所有数据都存在于缓存中(我感兴趣的情况),缓存一致性协议会处理 CAS.

Reading Intel's Architecture Software Developer's Manual, I find that if all data is present in cache (which the situation I'm interested in), the cache coherency protocol takes care of CAS.

Drepper 在他的白皮书中描述了 LL/SC 以及它如何使用 MESI.

Drepper in his white paper describes LL/SC and how it works using MESI.

在我看来,CAS 以类似的方式运行是合理的.

It seems reasonable to me for CAS to operate in a similar manner.

让我们考虑两个线程的情况.第一个线程开始它的 CAS.带有目的地的缓存行在其缓存中并标记为独占.

Let's consider the two thread case. First thread begins its CAS. The cache line with the destination is in its cache and marked exclusive.

第二个线程开始CAS.第一个内核将其缓存行发送到第二个内核,并且两个内核都将该缓存行标记为共享.

Second thread begins to CAS. The first core sends its cache line over to the second core and both cores have that cache line marked shared.

第一个线程完成 CAS 并写入缓存行(写入总是发生在 x86/x64 上,即使比较结果为 false;它只写入原始值).

First thread completes the CAS and writes to the cache line (the write always occurs on x86/x64, even if the compare was false; it just writes the original value).

写入行为将缓存行标记为已修改;发生 RFO,导致第二个内核将其缓存行标记为无效.

The act of writing marks the cache line as modified; a RFO occurs, causing the second core to mark its cache line as invalid.

第二个线程完成它的 CAS 并注意到它的缓存行无效...然后,怎么办?我发现很难相信指令在 CPU 内部循环直到它成功 - 尽管我想知道,因为 ARM 上的 LL/SC 要求 在你的程序集中执行这个循环.但是CAS指令知道destination的值发生了变化,所以它的比较结果是无效的.但是 CAS 不可能出错;它总是为比较返回真或假.但即使指令循环直到完成,我仍然期望完美的缩放.仍应使用每个槽".

The second thread comes to complete its CAS and notices its cache line is invalid... and then, what? I find it hard to believe the instruction is in the CPU internally looped until it succeeds - although I wonder, because LL/SC on ARM requires you in your assembly to do this loop. But the CAS instruction knows that the value of destination has changed, so the results of its compare are invalid. But there's no error possible with CAS; it always returns true or false for the compare. But even if the instructions do loop until complete, I'd still expect perfect scaling. Each 'slot' should still be used.

那么会发生什么?CAS 发生了什么?

So what happens? what is happening with CAS?

我所看到的是,随着线程数的增加,完成的工作越来越少——所有可用的插槽"肯定都没有被使用.有什么东西导致了这种情况.CAS指令之间是破坏性干扰吗?还是大量的 RFO 占用了 CPU->北桥总线?

What I do see is that as the thread count rises, less and less work is done - all available 'slots' certainly are not being used. Something is causing this. Is it destructive interference between CAS instructions? or it is a large number of RFO hogging the CPU->northbridge bus?

我非常感兴趣地注意到,同一物理核心上的两个线程可以完美地扩展.在这种情况下发生了一些特殊和不同的事情 - 不同物理内核上的两个线程也可以扩展一半.但这还不足以解释这一切.

What I do note with considerable interest is that two threads on the same physical core scale perfectly. Something special and different is happening in that case - two threads on separate physical cores scale half as well. But it's not enough of a clue to explain it all.

推荐答案

您在此处看到的是在两个物理内核的 L1 缓存之间移动数据的成本.当仅使用一个内核时,数据位于该 L1 缓存中,每个 CAS 都以缓存中的数据全速运行.另一方面,当有两个内核处于活动状态时,每次一个内核成功写入数据时,都会使另一个缓存失效,这将导致数据需要在另一个内核之间进行复制,然后另一个内核才能执行任何操作(一般情况下,它会在CAS完成之前阻塞等待加载).这比实际的 CAS 昂贵得多(它至少需要将数据向上移动到 L3 缓存,然后再向下移动到另一个 L1 缓存),并导致您看到的速度变慢,因为数据最终会进行乒乓处理在两个 L1 缓存之间来回

What you're seeing here is the cost of moving the data between the L1 caches of the two physical cores. When only using one core, the data sits in that L1 cache and each CAS operates at full speed with the data in the cache. When there are two cores active, on the other hand, each time a core succeeds in writing to the data, it will invalidate the other cache, which will result in the data needing to be copied between the caches before the other core can do anything (generally, it will block waiting for the load before the CAS to complete). This is much more expensive than the actual CAS (it needs to move the data up to the L3 cahce at least and then back down to the other L1 cache), and results in the slowdown you see, as the data ends up ping-ponging back and forth between the two L1 caches

这篇关于CAS 碰撞的 CPU 内部特性是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆