为什么要刷新由其他逻辑处理器引起的内存顺序违规的管道? [英] Why flush the pipeline for Memory Order Violation caused by other logical processors?

查看:14
本文介绍了为什么要刷新由其他逻辑处理器引起的内存顺序违规的管道?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Memory Order Machine Clear 性能事件是vTune 文档描述为:

内存排序 (MO) 机器清除发生在来自另一个处理器的监听请求与管道中数据操作的源匹配时.在这种情况下,管道在进行中的加载和存储退出之前被清除.

The memory ordering (MO) machine clear happens when a snoop request from another processor matches a source for a data operation in the pipeline. In this situation the pipeline is cleared before the loads and stores in progress are retired.

但是我不明白为什么会这样.不同逻辑处理器上的加载和存储之间没有同步顺序.
处理器可以假装在所有当前正在进行的数据操作都提交之后发生了窥探.

However I don't see why that should be the case. There is no synchronisation order between loads and stores on different logical processors.
The processor could just pretend the snoop happened after all the current in-flight data operations are committed.

该问题也在此处进行了描述

每当 CPU 内核检测到内存排序冲突"时,就会触发内存排序机器清除.基本上,这意味着一些当前挂起的指令试图访问我们刚刚发现其他 CPU 内核同时写入的内存.由于这些指令仍然被标记为挂起,而这个内存刚刚被写入"事件意味着其他一些内核成功完成了写入,挂起的指令——以及所有依赖于它们的结果——追溯性地是不正确的:当我们开始执行这些说明,我们使用的是现在已过时的内存内容版本.所以我们需要把所有的工作都扔掉,然后再做一遍.那是机器清晰.

A memory ordering machine clear gets triggered whenever the CPU core detects a "memory ordering conflict". Basically, this means that some of the currently pending instructions tried to access memory that we just found out some other CPU core wrote to in the meantime. Since these instructions are still flagged as pending while the "this memory just got written" event means some other core successfully finished a write, the pending instructions – and everything that depends on their results – are, retroactively, incorrect: when we started executing these instructions, we were using a version of the memory contents that is now out of date. So we need to throw all that work out and do it over. That’s the machine clear.

但这对我来说没有意义,CPU 不需要重新执行加载队列中的加载,因为没有非锁定加载/存储的总顺序.

But that makes no sense to me, the CPU doesn't need to re-execute the loads in the Load-Queue as there is no total order for non locked loads/stores.

我发现一个问题是允许重新排序负载:

I could see a problem is loads were allowed to be reordered:

;foo is 0
mov eax, [foo]    ;inst 1
mov ebx, [foo]    ;inst 2
mov ecx, [foo]    ;inst 3

如果执行顺序是 1 3 2 那么像 mov [foo], 1 这样介于 3 和 2 之间的存储会导致

If the execution order would be 1 3 2 then a store like mov [foo], 1 between 3 and 2 would cause

eax = 0
ebx = 1
ecx = 0

这确实会违反内存排序规则.

which would indeed violate the memory ordering rules.

但是负载不能与负载重新排序,那么为什么当来自另一个内核的监听请求与任何正在进行的负载源匹配时,英特尔的 CPU 会刷新管道?
这种行为可以防止哪些错误情况?

But loads cannot be reorder with loads, so why Intel's CPUs flush the pipeline when a snoop request from another core matches the source of any in-flight load?
What erroneous situations is this behaviour preventing?

推荐答案

尽管 x86 内存排序模型不允许加载到 WC 以外的任何内存类型,但可以在程序顺序之外全局观察到,实现 实际上允许负载无序完成.在所有先前的加载完成之前停止发布加载请求将是非常昂贵的.考虑以下示例:

Although the x86 memory ordering model does not allow loads to any memory type other than WC to be globally observable out of program order, the implementation actually allows loads to complete out of order. It would be very costly to stall issuing a load request until all previous loads have completed. Consider the following example:

load X
load Y
load Z

假设行 x 不存在于缓存层次结构中,必须从内存中获取.但是,Y 和 Z 都存在于 L1 缓存中.维持 x86 加载顺序要求的一种方法是在加载 X 获取数据之前不发出加载 Y 和 X.但是,这会停止所有依赖 Y 和 Z 的指令,从而导致潜在的巨大性能损失.

Assume that line x is not present in the cache hierarchy and has to be fetched from memory. However, both Y and Z are present in the L1 cache. One way to maintain the x86 load ordering requirement is by not issuing loads Y and X until load X gets the data. However, this would stall all instructions that depend on Y and Z, resulting in a potentially massive performance hit.

在文献中已经提出并广泛研究了多种解决方案.英特尔在其所有处理器中实施的一项是允许无序加载加载,然后检查是否发生了内存排序违规,在这种情况下,违规加载被重新发布并重放其所有相关指令.但这种违规只有在满足以下条件时才会发生:

Multiple solutions have been proposed and studied extensively in the literature. The one that Intel has implemented in all of its processors is allowing loads to be issued out of order and then check whether a memory ordering violation has occurred, in which case the violating load is reissued and all of its dependent instructions are replayed. But this violation can only occur when the following conditions are satisfied:

  • 加载已完成,而程序顺序中的前一个加载仍在等待其数据,并且两次加载都针对需要排序的内存类型.
  • 另一个物理或逻辑内核修改了后面加载读取的行,并且在较早加载获取其数据之前发出加载的逻辑内核已检测到此更改.

当这两种情况都发生时,逻辑内核会检测到内存排序违规.考虑以下示例:

When both of these conditions occur, the logical core detects a memory ordering violation. Consider the following example:

------           ------
core1            core2
------           ------
load rdx, [X]    store [Y], 1
load rbx, [Y]    store [X], 2
add  rdx, rbx
call printf

假设初始状态为:

  • [X] = [Y] = 0.
  • 包含 Y 的缓存行已经存在于 core1 的 L1D 中.但是 X 不存在于 core1 的私有缓存中.
  • 行 X 以可修改的相干状态存在于 core2 的 L1D 中,行 Y 以可共享的状态存在于 core2 的 L1D 中.

根据 x86 强排序模型,唯一可能的合法结果是 0、1 和 3.特别是,结果 2 是不合法的.

According to the x86 strong ordering model, the only possible legal outcomes are 0, 1, and 3. In particular, the outcome 2 is not legal.

可能会发生以下事件序列:

The following sequence of events may occur:

  • Core2 为两条生产线发出 RFO.X 行的 RFO 将很快完成,但 Y 行的 RFO 必须一直到达 L3,以使 core1 的私有缓存中的行无效.请注意,core2 只能按顺序提交存储,因此 X 行的存储会等待,直到 Y 行的存储提交.
  • Core1 向 L1D 发出两个负载.来自行 Y 的加载很快完成,但来自 X 的加载需要从 core2 的私有缓存中获取该行.请注意,此时 Y 的值为零.
  • 行 Y 从 core1 的私有缓存中失效,其在 core2 中的状态更改为可修改的一致性状态.
  • Core2 现在按顺序提交两个存储.
  • X 行从 core2 转发到 core1.
  • Core1 从缓存行 X 加载 core2 存储的值,即 2.
  • Core1 打印 X 和 Y 的总和,即 0 + 2 = 2.这是非法结果.本质上,core1 加载了一个陈旧的 Y 值.

为了保持加载顺序,core1 的加载缓冲区必须监听到驻留在其私有缓存中的行的所有失效.当它检测到行 Y 已无效而在程序顺序中从无效行完成的加载之前还有挂起的加载时,就会发生内存排序违规并且必须重新发出加载,之后它会获得最新的值.请注意,如果行 Y 在其失效之前和来自 X 的加载完成之前已从 core1 的私有缓存中逐出,则它可能无法首先侦听行 Y 的失效.所以也需要有一种机制来处理这种情况.

To maintain the ordering of loads, core1's load buffer has to snoop all invalidations to lines resident in its private caches. When it detects that line Y has been invalidated while there are pending loads that precede the completed load from the invalidated line in program order, a memory ordering violation occurs and the load has to be reissued after which it gets the most recent value. Note that if line Y has been evicted from core1's private caches before it gets invalidated and before the load from X completes, it may not be able to snoop the invalidation of line Y in the first place. So there needs to be a mechanism to handle this situation as well.

如果 core1 从不使用加载的值中的一个或两个,加载顺序违规可能发生,但永远无法观察到.同样,如果 core2 存储到 X 行和 Y 行的值相同,则可能会发生加载顺序违规,但无法观察到.但是,即使在这些情况下,core1 仍会不必要地重新发出违规负载并重放其所有依赖项.

If core1 never uses one or both of the values loaded, a load ordering violation may occur, but it can never be observed. Similarly, if the values stored by core2 to lines X and Y are the same, a load ordering violation may occur, but is impossible to observe. However, even in these cases, core1 would still unnecessarily reissue the violating load and replay all of its dependencies.

这篇关于为什么要刷新由其他逻辑处理器引起的内存顺序违规的管道?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆