Spectre(v2)的内部运作 [英] The inner workings of Spectre (v2)

查看:180
本文介绍了Spectre(v2)的内部运作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经阅读了有关Spectre v2的一些文章,显然您会得到非技术性的解释。彼得·科德斯(Peter Cordes)有更深入的解释,但并未完全解决一些细节。注意:我从未进行过Spectre v2攻击,因此没有实际经验。我只读过有关该理论的文章。

I have done some reading about Spectre v2 and obviously you get the non technical explanations. Peter Cordes has a more in-depth explanation but it doesn't fully address a few details. Note: I have never performed a Spectre v2 attack so I do not have hands on experience. I have only read up about about the theory.

我对Spectre v2的理解是,您对实例 if(输入< ; data.size)。如果间接目标数组(我不太确定其详细信息,即为什么与BTB结构分开)(在解码时针对间接分支的RIP重新检查)不包含预测,则它将插入新的跳转RIP(分支执行最终将插入分支的目标RIP),但目前它尚不知道跳转的目标RIP,因此任何形式的静态预测均不起作用。我的理解是,它将始终预测不使用新的间接分支,并且当端口6最终确定跳转目标RIP并进行预测时,它将使用BOB进行回滚并使用正确的跳转地址更新ITA,然后更新本地和全局分支历史记录寄存器和饱和计数器相应地。

My understanding of Spectre v2 is that you make an indirect branch mispredict for instance if (input < data.size). If the Indirect Target Array (which I'm not too sure of the details of -- i.e. why it is separate from the BTB structure) -- which is rechecked at decode for RIPs of indirect branches -- does not contain a prediction then it will insert the new jump RIP (branch execution will eventually insert the target RIP of the branch), but for now it does not know the target RIP of the jump so any form of static prediction will not work. My understanding is it is always going to predict not taken for new indirect branches and when Port 6 eventually works out the jump target RIP and prediction it will roll back using the BOB and update the ITA with the correct jump address and then update the local and global branch history registers and the saturating counters accordingly.

黑客需要训练饱和计数器,以便始终预测通过运行 if(input< data大小)循环多次,其中 input 设置为确实小于 data.size (相应地捕获错误),并在循环的最后迭代中,使输入大于 data.size (例如1000);

The hacker needs to train the saturating counters to always predict taken which, I imagine, they do by running the if(input < data.size) multiple times in a loop where input is set to something that is indeed less than data.size (catching errors accordingly) and on the final iteration of the loop, make input more than data.size (1000 for instance); the indirect branch will be predicted taken and it will jump to the body of the if statement where the cache load takes place.

if语句包含,那么间接分支将被预测为采用,并将跳转到if语句的主体。 secret = data [1000] (包含秘密数据的特定内存地址(data [1000])的目标是从内存加载到缓存),然后将其推测性地分配给加载缓冲区。先前的间接分支仍在分支执行单元中,并等待完成。

The if statement contains secret = data[1000] (A particular memory address (data[1000]) that contains secret data is targeted for loading from memory to cache) then this will be allocated to the load buffer speculatively. The preceding indirect branch is still in the branch execution unit and waiting to complete.

我认为前提是需要执行加载(已分配行填充缓冲区)在错误预测之前刷新装入缓冲区。如果已经为其分配了行填充缓冲区,则无法执行任何操作。有一种取消行填充缓冲区分配的机制是有道理的,因为在将行填充缓冲区返回到加载缓冲区之后再存储到缓存之前,必须先填充行填充缓冲区。这可能会导致行填充缓冲区变得饱和,因为不是在需要时进行分配(将其保留在此处以加快其他加载到同一地址的速度,而是在没有其他可用行缓冲区时进行分配)。直到收到刷新信号 not 的信号时,它才能取消分配,这意味着它必须暂停执行上一个分支,而不是立即使行填充缓冲区可用于其他逻辑核心的存储。这种信令机制可能难以实现,也许没有引起他们的注意(Spectre思维),并且如果分支执行花费足够的时间来挂起行填充缓冲区以引起性能影响,则还会引入延迟。 data.size 在循环的最后迭代之前有意地从缓存中刷新( CLFLUSH ),这意味着分支执行可能需要执行最多100个周期。

I believe the premise is that the load needs to be executed (assigned a line fill buffer) before the load buffers are flushed on the misprediction. If it has been assigned a line fill buffer already then nothing can be done. It makes sense that there isn't a mechanism to cancel a line fill buffer allocation because the line fill buffer would have to pend before storing to the cache after returning it to the load buffer. This could cause line fill buffers to become saturated because instead of deallocating when required (keeping it in there for speed of other loads to the same address but deallocating when the there are no other available line buffers). It would not be able to deallocate until it receives some signal that a flush is not going to occur, meaning it has to halt for the previous branch to execute instead of immediately making the line fill buffer available for the stores of the other logical core. This signalling mechanism could be difficult to implement and perhaps it didn't cross their minds (pre-Spectre thinking) and it would also introduce delay in the event that branch execution takes enough time for hanging line fill buffers to cause a performance impact i.e. if data.size is purposefully flushed from the cache (CLFLUSH) before the final iteration of the loop meaning branch execution could take up to 100 cycles.

我希望我的想法是正确的,但我不确定100%。如果有人有任何要补充或纠正的内容,请这样做。

I hope my thinking is correct but I'm not 100% sure. If anyone has anything to add or correct then please do.

推荐答案

有时, BTB一词统称为全部分支预测单元使用的缓冲器的数量。但是,实际上有多个缓冲区,每个周期都使用所有缓冲区来进行目标和方向预测。特别是,BTB用于直接分支的预测,ITB(间接目标缓冲区)用于除了收益以外的间接分支的预测,而RSB用于收益的预测。 ITB也称为IBTB或间接目标阵列。所有这些术语都由不同的供应商和研究人员使用。通常,当其他缓冲区未命中时,BTB用于对各种分支指令进行初始预测。但是后来,预测变量了解了有关分支的更多信息,其他缓冲区开始起作用。如果同一间接分支的多个动态实例具有相同的目标,则也可以使用BTB代替ITB。当同一个分支有多个目标并且专门用于处理此类分支时,ITB会更加准确。请参阅:分支预测和口译员的表现-不要相信民俗。奔腾M是第一个实现单独的BTB和ITB结构的英特尔处理器。所有后来的英特尔酷睿处理器具有专用的ITB。

Sometimes the term "BTB" is used collectively to refer to all of the buffers used by the branch prediction unit. However, there are actually multiple buffers all of which are used in every cycle to make target and direction predictions. In particular, the BTB is used to make predictions for direct branches, the ITB (indirect target buffer) is used to make predictions for indirect branches except for returns, and the RSB is used to make predictions for returns. The ITB is also called the IBTB or Indirect Target Array. All of these terms are used by different vendors and researchers. Typically, the BTB is used to make initial predictions for all kinds of branch instructions when the other buffers miss. But later the predictor learns more about the branches and the other buffers come into play. If multiple dynamic instances of the same indirect branch have all the same target, then the BTB might also be used instead of the ITB. The ITB is much more accurate when the same branch has multiple targets and is designed specifically to deal with such branches. See: Branch prediction and the performance of interpreters — Don't trust folklore. The first Intel processor that implemented separate BTB and ITB structures is the Pentium M. All later Intel Core processors have dedicated ITBs.

Spectre V1漏洞利用程序是基于使用攻击程序训练BTB的,因此,当受害者执行别名相同的BTB条目的分支时,处理器被诱使以推测方式执行指令(称为小工具)以泄漏信息。 Spectre V2攻击类似,但基于训练ITB。此处的关键区别在于,在V1中,处理器错误地预测了分支的方向,而在V2中,处理器错误地预测了分支的 target (并且条件间接分支,以及方向,因为我们希望采用它。在解释,JIT编译或利用动态多态性的程序中,可能有许多间接分支(除了返回)。特定的间接分支可能永远不会到达某个位置,但是通过对预测变量进行训练,可以使它跳转到我们想要的任何位置。正是出于这个原因,V2非常强大。无论小工具在哪里,程序的故意控制流是什么,您都可以选择一个间接分支之一,并使其推测性地跳转到该小工具。

The Spectre V1 exploit is based on training the BTB using an attacker program so that when the victim executes a branch that aliases the same BTB entry, the processor is tricked into speculatively executing instructions (called the gadget) to leak information. The Spectre V2 exploit is similar but is based on training the ITB instead. The crucial difference here is that in V1, the processor mispredicts the direction of the branch, while in V2, the processor mispredicts the target of the branch (and, in case of a conditional indirect branch, the direction as well because we want it to be taken). In programs that are interpreted, JIT-compiled, or make use of dynamic polymorphism, there can be many indirect branches (other than returns). A particular indirect branch may never be intended to go to some location, but by mistraining the predictor, it can be made to jump anywhere we want. It is exactly for this reason why V2 is very powerful; no matter where the gadget is and no matter what the intentional control flows of the program are, you can pick one of the indirect branches and make it speculatively jump to the gadget.

请注意,通常,静态直接分支目标的线性地址在程序的整个生命周期中都保持不变。只有一种情况可能不是这种情况:动态代码修改。因此,至少在理论上,可以基于对直接分支的 target 错误预测来开发Spectre漏洞。

Note that typically the linear address of the target of a static direct branch remains the same throughout the lifetime of the program. There is only one situation where this may not be the case: dynamic code modification. So at least in theory, a Spectre exploit can be developed based on target misprediction of direct branches.

关于LFB的回收,我不真的不明白你在说什么。当错过L1D的负载请求将数据接收到LFB中时,数据将立即转发到管道的旁路互连。需要有一种方法来确定哪个负载uop已请求此数据。返回的数据必须使用加载的uop ID进行标记。 RS中等待数据的微指令的源表示为负载的微指令ID。另外,保存负载uop的ROB条目需要标记为已完成,以便可以撤消,并且在SnB之前,需要将返回的数据写入ROB。如果在管道刷新时未取消LFB中未完成的加载请求,并且如果负载uop ID被其他uop重用,则在数据到达时,它可能会错误地转发到管道中当前存在的任何新uops,从而破坏了微体系结构状态。因此,需要一种方法来确保在任何情况下都不会发生这种情况。通过将所有有效的LFB条目简单标记为已取消,很可能取消未完成的负载请求和管道刷新上的推测性RFO,只是为了避免数据返回到管道中。但是,仍可能会提取数据并将其填充到一层或多层缓存中。 LFB中的请求由行对齐的物理地址标识。可能还有其他可能的设计。

Regarding reclamation of LFBs, I don't really understand what you're saying. When a load request that missed the L1D receives the data into the LFB, the data is immediately forwarded to the bypass interconnect of the pipeline. There needs to be a way to determine which load uop has requested this data. The data returned must be tagged with the uop ID of the load. The sources of the uops in the RS that are waiting for the data are represented as the uop IDs of the loads. In addition, the ROB entry that holds the load uop needs to be marked as completed so that it can be retired and, in pre-SnB, the returned data needs to be written into the ROB. If, on pipeline flush, an outstanding load request in an LFB is not cancelled, and if the load uop ID got reused for some other uop, when the data arrives, it might be incorrectly forwarded to whatever new uops are currently in the pipeline, thereby corrupting the microarchitectural state. So there needs to be a way to ensure that this does not happen under any circumstances. It's very possible to cancel outstanding load requests and speculative RFOs on a pipeline flush by simple marking all of the valid LFB entries as "cancelled", just so that the data is not returned to the pipeline. However, the data might still be fetched and filled in into one or more levels of caches. Requests in the LFB are identified by line-aligned physical addresses. There can be other possible designs.

我决定进行一次实验,以确定何时将LFB释放到Haswell上。它是这样工作的:

I've decided to run an experiment to determine exactly when the LFBs get deallocated on Haswell. Here is how it works:

Outer Loop (10K iterations):

Inner Loop (100 iterations):
10 load instructions to different cache lines most of which miss the L2.
LFENCE.
A sequence of IMULs to delay the resolution of the jump by 18 cycles.
Jump to inner.

3 load instructions to different cache lines.
LFENCE.
Jump to outer.

要使其正常工作,需要关闭超线程和两个L1预取器,以确保我们拥有所有L1的10个LFB中的一个。

For this to work, hyperthreading and both L1 prefetchers need to be turned off to ensure that we own all of the 10 LFBs of the L1.

LFENCE 指令可确保在以下情况下我们不会用完LFB:在正确预测路径上执行。此处的关键思想是内部跳转每次外部迭代都会被错误预测一次,因此在错误预测路径上最多可以在LFB中分配10个内部迭代负载。请注意, LFENCE 阻止分配来自后续迭代的负载。几个周期后,将解决内部分支并发生错误预测。清除管道,并恢复前端,以获取并执行外循环中的加载指令。

The LFENCE instructions ensure that we don't run out of LFBs when executing on a correctly predicted path. The key idea here is that the inner jump will be mispredicted once per outer iteration, so up to 10 loads of the inner iteration that are on the mispredicted path can be allocated in the LFBs. Note that the LFENCE prevents loads from later iterations to be allocated. After a few cycles, the inner branch will be resolved and a misprediction occurs. The pipeline is cleared and the frontend is resteered to fetch and execute the load instructions in the outer loop.

有两种可能的结果:


  • 已分配给LFB的错误预测路径上的负载将作为管道清理操作的一部分立即释放,并可供其他负载使用。在这种情况下,不会由于LFB不可用而造成停顿(使用 L1D_PEND_MISS.FB_FULL 进行计数)。

  • 仅释放LFB

  • The LFBs that have been allocated for the loads on the mispredicted path are immediately released as part of the pipeline clear operation and made available for other loads. In this case, there will be no stalls due to LFB unavailability (counted using L1D_PEND_MISS.FB_FULL).
  • The LFBs are released only when the loads get serviced irrespective of whether they were on a mispredicted path.

当内部循环之后的外部循环中有三个载荷时,无论载荷是否在错误的路径上都得到服务跳,测量值 L1D_PEND_MISS.FB_FULL 大约等于外部迭代的次数。每次外循环迭代一个请求。这意味着,当正确路径上的三个负载发送到L1D时,来自错误路径的负载仍将占据8个LFB条目,从而导致第三个负载的FB满事件。这表明LFB中的负载只有在负载实际完成后才进行脱层处理。

When there are three loads in the outer loop after the inner jump, the measured value of L1D_PEND_MISS.FB_FULL is about equal to the number of outer iterations. That's one request per outer loop iteration. This means that when the three loads on the correct path get issued to the L1D, the loads from the mispredcited path are still occupying the 8 LFB entries, resulting in an FB full event for the third load. This suggests that loads in the LFBs only get deallcoated when the load actually completes.

如果我在外部循环中放置的负载少于两个,则基本上不会有FB已满事件。我注意到一件事:在外循环中,除了三个负载之外,每增加一个负载, L1D_PEND_MISS.FB_FULL 都会增加约20K,而不是预期的10K。我认为正在发生的事情是,当首次向L1D发出加载uop的加载请求并且所有LFB都在使用时,它被拒绝了。然后,当LFB可用时,会将负载缓冲区中的两个未决负载发送到L1D,一个将在LFB中分配,另一个将被拒绝。因此,每增加一个负载,我们就会得到两个LFB完整事件。但是,当外循环中有三个负载时,只有第三个负载在等待LFB,因此每次外循环迭代都会得到一个事件。本质上,加载缓冲区无法区分是使用一个LFB还是两个LFB。只能知道至少有一个LFB是空闲的,因此由于有两个加载端口,它尝试同时发送两个加载请求。

If I put less than two loads in the outer loop, there will be basically no FB full events. There is one thing I noticed: for every additional load in the outer loop beyond three loads, the L1D_PEND_MISS.FB_FULL gets increased by about 20K instead of the expected 10K. I think what's happening is that when a load request of a load uop gets issued to the L1D for the first time and the all LFBs are in use, it gets rejected. Then when an LFB becomes available, two loads pending in the load buffer are sent to the L1D, one will be allocated in the LFB and the other will get rejected. So we get two LFB full events per additional load. However, when there are three loads in the outer loop, only the third one would be waiting for an LFB, so we get one event per outer loop iteration. Essentially, the load buffer cannot distinguish between having one LFB available or two LFBs; it only gets to know that at least one LFB is free and so it attempts to send two load requests at the same time since there are two load ports.

这篇关于Spectre(v2)的内部运作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆