为什么要刷新其他逻辑处理器引起的内存顺序违规的管道? [英] Why flush the pipeline for Memory Order Violation caused by other logical processors?

查看:119
本文介绍了为什么要刷新其他逻辑处理器引起的内存顺序违规的管道?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

内存订单计算机清除性能事件为vTune文档将其描述为:

The Memory Order Machine Clear performance event is described by the vTune documentation as:

当来自另一个处理器的侦听请求与管道中的数据操作源相匹配时,将发生内存排序(MO)机器清除.在这种情况下,在撤消正在进行的装载和存储之前,应清理管道.

The memory ordering (MO) machine clear happens when a snoop request from another processor matches a source for a data operation in the pipeline. In this situation the pipeline is cleared before the loads and stores in progress are retired.

但是我不明白为什么会这样.在不同逻辑处理器上的加载和存储之间没有同步顺序.
处理器可以在所有当前运行中的数据操作被提交之后,假装监听发生.

However I don't see why that should be the case. There is no synchronisation order between loads and stores on different logical processors.
The processor could just pretend the snoop happened after all the current in-flight data operations are committed.

此处

每当CPU内核检测到内存排序冲突"时,就会触发内存排序机清除.基本上,这意味着一些当前待处理的指令试图访问我们刚刚发现同时写入了其他CPU内核的内存.由于这些指令仍被标记为待处理,而此存储器刚刚被写入"事件则表示其他某个内核已成功完成写入,因此,待处理指令以及所有取决于其结果的内容都是不正确的:当我们开始执行这些指令时在说明中,我们使用的内存内容版本已过时.因此,我们需要把所有工作都扔掉,然后再做完.这很清楚.

A memory ordering machine clear gets triggered whenever the CPU core detects a "memory ordering conflict". Basically, this means that some of the currently pending instructions tried to access memory that we just found out some other CPU core wrote to in the meantime. Since these instructions are still flagged as pending while the "this memory just got written" event means some other core successfully finished a write, the pending instructions – and everything that depends on their results – are, retroactively, incorrect: when we started executing these instructions, we were using a version of the memory contents that is now out of date. So we need to throw all that work out and do it over. That’s the machine clear.

但这对我来说没有意义,CPU不需要重新执行Load-Queue中的装载,因为没有针对非锁定装载/存储的总订单.

But that makes no sense to me, the CPU doesn't need to re-execute the loads in the Load-Queue as there is no total order for non locked loads/stores.

我可以看到一个问题,即允许对负载进行重新排序:

I could see a problem is loads were allowed to be reordered:

;foo is 0
mov eax, [foo]    ;inst 1
mov ebx, [foo]    ;inst 2
mov ecx, [foo]    ;inst 3

如果执行顺序为1 3 2,则类似mov [foo], 1的商店在3到2之间将导致

If the execution order would be 1 3 2 then a store like mov [foo], 1 between 3 and 2 would cause

eax = 0
ebx = 1
ecx = 0

这确实违反了内存排序规则.

which would indeed violate the memory ordering rules.

但是负载不能与负载重新排序,所以为什么当另一个内核发出的监听请求与任何飞行负载的来源相匹配时,英特尔的CPU会刷新管道?
这种行为可以防止什么错误情况?

But loads cannot be reorder with loads, so why Intel's CPUs flush the pipeline when a snoop request from another core matches the source of any in-flight load?
What erroneous situations is this behaviour preventing?

推荐答案

尽管x86内存排序模型不允许以程序顺序全局观察到WC以外的任何其他内存类型的负载,但实现实际上允许负载无序完成.在所有先前的负载都已完成之前,暂停发出负载请求会非常昂贵.考虑以下示例:

Although the x86 memory ordering model does not allow loads to any memory type other than WC to be globally observable out of program order, the implementation actually allows loads to complete out of order. It would be very costly to stall issuing a load request until all previous loads have completed. Consider the following example:

load X
load Y
load Z

假定第x行不存在于缓存层次结构中,而必须从内存中获取.但是,Y和Z都存在于L1高速缓存中.满足x86负载排序要求的一种方法是在负载X获得数据之前不发出负载Y和X.但是,这会使所有依赖Y和Z的指令停顿,从而可能会严重打击性能.

Assume that line x is not present in the cache hierarchy and has to be fetched from memory. However, both Y and Z are present in the L1 cache. One way to maintain the x86 load ordering requirement is by not issuing loads Y and X until load X gets the data. However, this would stall all instructions that depend on Y and Z, resulting in a potentially massive performance hit.

已经提出了多种解决方案,并且在文献中进行了广泛的研究.英特尔已在其所有处理器中实现的功能是允许按顺序发出负载,然后检查是否发生了内存排序违规,在这种情况下,将重发违反的负载并重播其所有相关指令.但是,只有在满足以下条件时,才会发生这种违规行为:

Multiple solutions have been proposed and studied extensively in the literature. The one that Intel has implemented in all of its processors is allowing loads to be issued out of order and then check whether a memory ordering violation has occurred, in which case the violating load is reissued and all of its dependent instructions are replayed. But this violation can only occur when the following conditions are satisfied:

  • 加载已完成,而程序顺序中的上一个加载仍在等待其数据,并且两次加载属于需要排序的内存类型.
  • 另一个物理或逻辑核心已修改了稍后加载读取的行,并且在较早加载获取其数据之前发出负载的逻辑核心已检测到此更改.

当这两种情况同时发生时,逻辑内核将检测到内存排序冲突.考虑以下示例:

When both of these conditions occur, the logical core detects a memory ordering violation. Consider the following example:

------           ------
core1            core2
------           ------
load rdx, [X]    store [Y], 1
load rbx, [Y]    store [X], 2
add  rdx, rbx
call printf

假定初始状态为:

  • [X] = [Y] = 0.
  • Core1的L1D中已经存在包含Y的缓存行.但是X不在core1的专用缓存中.
  • X线以可更改的相干状态出现在core2的L1D中,而Y线以可共享的状态出现在core2的L1D中.

根据x86强排序模型,唯一可能的合法结果是0、1和3.特别是,结果2是不合法的.

According to the x86 strong ordering model, the only possible legal outcomes are 0, 1, and 3. In particular, the outcome 2 is not legal.

可能会发生以下事件顺序:

The following sequence of events may occur:

  • Core2为这两个生产线签发RFO. X行的RFO将很快完成,但Y行的RFO必须一直到L3,以使core1的专用缓存中的行无效.请注意,core2只能按顺序提交存储,因此到第X行的存储要等到提交到Y行的存储.
  • Core1将两个负载发布到L1D. Y行的加载很快完成,但是X行的加载需要从core2的专用缓存中获取该行.请注意,此时的Y值为零.
  • 第Y行已从core1的专用缓存中无效,并且其在core2中的状态已更改为可修改的一致性状态.
  • Core2现在按顺序提交两个存储.
  • 第X行从core2转发到core1.
  • Core1从高速缓存行X加载core2存储的值为2.
  • Core1打印X和Y的总和,即0 + 2 =2.这是非法结果.本质上,core1已加载过时的值Y.

要维持加载的顺序,core1的加载缓冲区必须将所有失效监听到驻留在其专用缓存中的行.当它检测到Y行已经无效时,在程序顺序中从无效行开始的未完成加载之前有未完成的加载时,就会发生内存排序冲突,必须重新发出该加载,之后它才能获得最新的值.请注意,如果在使其无效和从X的加载完成之前,已将行Y从core1的专用缓存中逐出,则它可能首先无法侦听行Y的无效.因此,还需要一种机制来处理这种情况.

To maintain the ordering of loads, core1's load buffer has to snoop all invalidations to lines resident in its private caches. When it detects that line Y has been invalidated while there are pending loads that precede the completed load from the invalidated line in program order, a memory ordering violation occurs and the load has to be reissued after which it gets the most recent value. Note that if line Y has been evicted from core1's private caches before it gets invalidated and before the load from X completes, it may not be able to snoop the invalidation of line Y in the first place. So there needs to be a mechanism to handle this situation as well.

如果core1从不使用加载的一个或两个值,则可能发生,但是永远不能观察.类似地,如果core2存储到行X和Y的值相同,则可能会发生负载排序冲突,但无法观察到.但是,即使在这些情况下,core1仍将不必要地重新发出违反的负载并重播其所有依赖项.

If core1 never uses one or both of the values loaded, a load ordering violation may occur, but it can never be observed. Similarly, if the values stored by core2 to lines X and Y are the same, a load ordering violation may occur, but is impossible to observe. However, even in these cases, core1 would still unnecessarily reissue the violating load and replay all of its dependencies.

这篇关于为什么要刷新其他逻辑处理器引起的内存顺序违规的管道?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆