当不同的CPU内核在不同步的情况下写入同一RAM地址时会发生什么情况? [英] What happens when different CPU cores write to the same RAM address without synchronization?

查看:101
本文介绍了当不同的CPU内核在不同步的情况下写入同一RAM地址时会发生什么情况?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

让我们假设2个内核试图在同一时间(正负eta)将不同的值写入相同的RAM地址(1个字节),并且不使用任何互锁的指令或内存屏障.在这种情况下会发生什么,什么值将被写入主RAM?第一个获胜?最后一个赢了?不确定的行为?

解决方案

x86(与其他主流SMP CPU架构一样)具有 一致性数据缓存 .两个不同的缓存(例如2个不同内核的L1D)不可能为同一缓存行保存冲突的数据.

硬件强加命令(通过某些特定于实现的机制来中断联系,以防两个所有权请求从不同的内核在同一时钟周期到达).在大多数现代x86 CPU中,第一个存储区不会写到RAM,因为有一个共享的回写L3缓存来吸收一致性流量,而无需往返存储器.

以全局顺序在两个存储之后出现的负载将看到第二个存储所存储的值.


(我假设我们正在谈论将普通(非NT)存储到可缓存的内存区域(WB,而不是USWC,UC甚至是WT).但是,在两种情况下,基本思想都是相同的; 一个商店先走,下一个商店就走.如果在全局顺序中碰巧出现负载,则可以临时观察来自第一家商店的数据,否则,将观察到来自该商店的数据.选择第二项的硬件会产生长期影响.

我们正在谈论的是一个字节,因此存储不能在两条高速缓存行之间分割,因此每个地址都自然对齐,因此 MESI协议(的变体)来完成此获取独占访问"的工作/strong> .高速缓存中的任何给定行都可以被修改(脏),排他(尚未写入),共享(干净副本;其他高速缓存也可能具有副本,因此在写入之前需要RFO(读取/请求所有权)),或无效的. MESIF(英特尔)/MOESI(AMD)添加了额外的状态来优化协议,但不要更改基本逻辑,即一个内核可以随时更改一条线路.

如果我们关心对两个不同行的多个更改的排序,那么内存排序将成为内存障碍.但是,对于在同一时钟周期内商店执行或退休的哪个商店获胜"这个问题,这都无关紧要.

执行存储时,它将进入存储队列.它可以提交给L1D,并在其退出后的任何时间(而不是之前)在全局可见.未弃用的指令被视为推测性的,因此它们的体系结构效果必须在CPU内核之外不可见.投机负载没有架构影响,只有微架构 1 .

因此,如果两个存储区都准备好在同一时间"提交(时钟不必在内核之间同步),那么另一个或另一个将使其RFO首先成功并获得独占访问权,并使其存储区数据在全局可见.然后,不久之后,另一个核心的RFO将成功执行并使用其数据更新缓存行,因此其存储在所有其他核心观察到的全局存储顺序中排名第二.

x86具有总存储顺序的内存模型,其中所有内核甚至对于存储在不同高速缓存行中的存储,都遵循相同顺序(除非始终以程序顺序查看其自己的存储).诸如PowerPC之类的一些弱排序的体系结构将允许某些内核看到与其他内核不同的总顺序,但是这种重新排序只能在商店到不同生产线之间发生.单个高速缓存行始终只有一个修改顺序. (相对于彼此和其他存储的负载重新排序意味着您必须小心如何在弱排序的ISA上进行观察,但是MESI对缓存行进行了单次修改). /p>

哪个人能赢得比赛可能取决于环形总线上核心的布局(线映射到该共享L3缓存的哪个切片)相对平淡. (请注意,使用种族"一词:这是种族条件"错误所描述的一种种族.编写代码时,两个不同步的商店更新同一位置并且您不在乎哪个会获胜,这并不总是错误的,但这很罕见.)

顺便说一句,现代x86 CPU具有硬件仲裁功能,以应对多个内核争用原子对同一高速缓存行进行原子读取-修改-写入的情况(因此什么是生产者与消费者之间在超级兄弟姐妹与非超级兄弟姐妹之间共享存储位置的延迟和吞吐成本?)

如果两个线程都只是在不加载的情况下进行存储时发生了类似的情况,则可能发生 IDK,但这可能不是因为存储没有按推测方式进行重新排序,并且存储队列使它们与乱序执行脱钩了.一旦存储指令退出,存储肯定会发生,因此OoO执行人员不必等待其实际提交. (实际上,必须从OoO内核中退出,然后才能提交,因为这是CPU知道它是非推测性的;即,没有更早的指令出错或分支预测错误)


脚注:

  1. Spectre通过使用缓存定时攻击将微体系结构状态读入架构状态来模糊该行.

Let's assume that 2 cores are trying to write different values to the same RAM address (1 byte), at the same moment of time (plus-minus eta), and without using any interlocked instructions or memory barriers. What happens in this case and what value will be written to the main RAM? The first one wins? The last one wins? Undetermined behavior?

解决方案

x86 (like every other mainstream SMP CPU architecture) has coherent data caches. It's impossible for two difference caches (e.g. L1D of 2 different cores) to hold conflicting data for the same cache line.

The hardware imposes an order (by some implementation-specific mechanism to break ties in case two requests for ownership arrive in the same clock cycle from different cores). In most modern x86 CPUs, the first store won't be written to RAM, because there's a shared write-back L3 cache to absorb coherency traffic without a round-trip to memory.

Loads that appear after both the stores in the global order will see the value stored by whichever store went second.


(I'm assuming we're talking about normal (not NT) stores to cacheable memory regions (WB, not USWC, UC, or even WT). The basic idea would be the same in either case, though; one store would go first, the next would step on it. The data from the first store could be observed temporarily if a load happened to get between them in the global order, but otherwise the data from the store that the hardware chose to do 2nd would be the long-term effect.

We're talking about a single byte, so the store can't be split across two cache lines, and thus every address is naturally aligned so everything in Why is integer assignment on a naturally aligned variable atomic on x86? applies.


Coherency is maintained by requiring a core to acquire exclusive access to that cache line before it can modify it (i.e. make a store globally visible by committing it from the store queue to L1D cache).

This "acquiring exclusive access" stuff is done using (a variant of) the MESI protocol. Any given line in a cache can be Modified (dirty), Exclusive (owned by not yet written), Shared (clean copy; other caches may also have copies so an RFO (Read / Request For Ownership) is required before write), or Invalid. MESIF (Intel) / MOESI (AMD) add extra states to optimize the protocol, but don't change the fundamental logic that only one core can change a line at any one time.

If we cared about ordering of multiple changes to two different lines, then memory ordering an memory barriers would come into play. But none of that matters for this question about "which store wins" when the stores execute or retire in the same clock cycle.

When a store executes, it goes into the store queue. It can commit to L1D and become globally visible at any time after it retires, but not before; unretired instructions are treated as speculative and thus their architectural effects must not be visible outside the CPU core. Speculative loads have no architectural effect, only microarchitectural1.

So if both stores become ready to commit at "the same time" (clocks are not necessarily synchronized between cores), one or the other will have its RFO succeed first and gain exclusive access, and make its store data globally visible. Then, soon after, the other core's RFO will succeed and update the cache line with its data, so its store comes second in the global store order observed by all other cores.

x86 has a total-store-order memory model where all cores observe the same order even for stores to different cache lines (except for always seeing their own stores in program order). Some weakly-ordered architectures like PowerPC would allow some cores to see a different total order from other cores, but this reordering can only happen between stores to different lines. There is always a single modification order for a single cache line. (Reordering of loads with respect to each other and other stores means that you have to be careful how you go about observing things on a weakly ordered ISA, but there is a single order of modification for a cache line, imposed by MESI).

Which one wins the race might depend on something as prosaic as the layout of the cores on the ring bus relative to which slice of shared L3 cache that line maps to. (Note the use of the word "race": this is the kind of race which "race condition" bugs describe. It's not always wrong to write code where two unsynchronized stores update the same location and you don't care which one wins, but it's rare.)

BTW, modern x86 CPUs have hardware arbitration for the case when multiple cores contend for atomic read-modify-write to the same cache line (and thus are holding onto it for multiple clock cycles to make lock add byte [rdi], 1 atomic), but regular loads/stores only need to own a cache line for a single cycle to execute a load or commit a store. I think the arbitration for locked instructions is a different thing from which core wins when multiple cores are trying to commit stores to the same cache line. Unless you use a pause instruction, cores assume that other cores aren't modifying the same cache line, and speculatively load early, and thus will suffer memory-ordering mis-speculation if it does happen. (What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings?)

IDK if anything similar happens when two threads are both just storing without loading, but probably not because stores aren't speculatively reordered and are decoupled from out-of-order execution by the store queue. Once a store instruction retires, the store is definitely going to happen, so OoO exec doesn't have to wait for it to actually commit. (And in fact it has to retirem from the OoO core before it can commit, because that's how the CPU knows it's non-speculative; i.e. that no earlier instruction faulted or was a mispredicted branch)


Footnotes:

  1. Spectre blurs that line by using a cache-timing attack to read microarchitectural state into the architectural state.

这篇关于当不同的CPU内核在不同步的情况下写入同一RAM地址时会发生什么情况?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆