这些年来,英特尔为何改变静态分支预测机制? [英] Why did Intel change the static branch prediction mechanism over these years?

查看:144
本文介绍了这些年来,英特尔为何改变静态分支预测机制?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

此处,我知道英特尔实现了几种静态分支预测这些年来的机制:

From here I know Intel implemented several static branch prediction mechanisms these years:


  • 80486年龄:永远不被摄取

  • 80486 age: Always-not-taken

Pentium4年龄:后退/未前进

Pentium4 age: Backwards Taken/Forwards Not-Taken

Ivy Bridge,Haswell等较新的CPU变得越来越无形,请参见马特·G的实验在这里

Newer CPUs like Ivy Bridge, Haswell have become increasingly intangible, see Matt G's experiment here.

英特尔似乎不想再谈论它了,因为我在英特尔文档中找到了最新资料是大约十年前写的。

And Intel seems don't want to talk about it any more, because the latest material I found within Intel Document was written about ten years ago.

我知道静态分支预测(远比动态)重要,但是在很多情况下,CPU会完全丢失,程序员(带有编译器)通常是最好的指南。当然,这些情况通常不是性能瓶颈,因为一旦频繁执行分支,动态预测器就会捕获它。

I know static branch prediction is (far?) less important than dynamic, but in quite a few situations, CPU will be completely lost and programmers(with compiler) are usually the best guide. Of course these situations are usually not performance bottleneck, because once a branch is frequently executed, the dynamic predictor will capture it.

由于Intel不再在其文档中明确声明动态预测机制,因此GCC的buildin_expect()只能从热路径中删除不太可能的分支。 p>

我对CPU设计并不熟悉,我也不知道英特尔如今将其静态预测器使用的确切机制是什么,但我仍然认为,英特尔的最佳机制应该是明确记录他的CPU动态预测器失败时我打算去哪里,前进还是后退,因为通常程序员是当时的最佳指南。

Since Intel no longer clearly statements the dynamic prediction mechanism in its document, the builtin_expect() of GCC can do nothing more than removing the unlikely branch from hot path.

更新:
我发现您提到的主题逐渐超出了我的理解范围。这里涉及一些动态预测机制和CPU内部细节,而我在两到三天内无法学习。因此,请允许我暂时退出您的讨论并重新充电。
这里仍然欢迎任何答案,也许会帮助更多人

推荐答案

之所以静态在现代设计中,预测不受欢迎,以至于甚至没有出现的原因是,与动态预测相比,静态预测在管道中出现得太晚了。基本的问题是,在提取分支方向和目标位置之前,必须先了解它们,但是静态预测只能在解码之后(提取之后)进行。

解决方案

更多信息...

简要地说,在执行期间需要从内存中提取指令,解码这些指令,然后执行它们 1 。在高性能CPU上,这些阶段将通过 pipelined 进行流水线处理,这意味着它们通常都将并行发生-但在任何给定时刻都针对不同的指令。您可以在Wikipedia上阅读有关> 的内容,但请记住,现代CPU更复杂,通常会有更多阶段。

In more detail...

在现代的x86上,使用复杂的可变长度的指令集,可能在获取和解码指令时可能涉及许多流水线阶段,也许六个或更多。这样的指令也是超标量,能够一次执行多条指令。这意味着当以最高效率执行时,将有许多指令处于飞行中,处于获取,解码,执行等各个阶段。

Briefly, during execution needs to fetch instructions from memory, decode those instructions and then execute them1. On a high-performance CPU, these stages will be pipelined, meaning that they will all generally be happening in parallel - but for different instructions at any given moment. You could read a bit about this on Wikipedia, but keep in mind that modern CPUs are more complex, generally with many more stages.

分支的效果在整个初始部分(通常称为 front-end ):当您跳到新地址时,您需要从该新地址中获取,从该新地址中进行解码,等等。我们说一个采用的分支需要重定向提取

On a modern x86, with a complex-to-decode variable-length instruction set, there may be many pipeline "stages" involved simply in fetching and decoding instructions, perhaps a half-dozen or more. Such instructions are also superscalar, capable of executing several instructions at once. This implies that when executing at peak efficiency, there will be many instructions in flight, in various stages of being fetched, decoded, executed and so on.

考虑静态预测的工作原理:它会查看指令,如果它是一个分支,则会对分支预测产生一定的限制。

The effect of a taken branch is felt on the entire initial portion (usually called the front-end) of the pipeline: when you jump to a new address, you need to fetch from that new address, decode from that new address, etc. We say that a taken branch needs to redirect fetch. This puts certain restrictions on the information that branch prediction can use in order to perform efficiently.

,比较其目标以查看它是前进还是后退。所有这些操作必须在发生解码后 进行,因为那是知道实际指令的时间。但是,如果检测到分支并预测采用了分支(例如,向后跳转),则预测器需要重定向取回,这是较早的流水线阶段。在解码指令 N 之后,取指被重定向时,已经有许多后续指令在错误的(未采用)路径上被取回和解码。这些必须扔掉。我们说在前端引入了 bubble

Consider how static prediction works: it looks at the instruction and if it is a branch, compares its target to see if it is "forwards" or "backwards". All this must happen largely after decoding has occurred, since that's when the actual instruction is known. However, if a branch is detected and is predicted taken (e.g., a backwards jump), the predictor needs to redirect fetch, which is many pipeline stages earlier. By the time fetch gets redirected after decoding instruction N there are already many subsequent instructions that were fetched and decoded on the wrong (not taken) path. Those have to be thrown away. We say that a bubble is introduced in the front-end.

所有这些操作的结果是,即使静态预测为100%正确,因为前端流水线已失效,所以在分支分支情况下效率非常低。如果在提取和解码结束之间有6个流水线级,则每个分支均会在流水线中引起6个周期的冒泡,并充分假设预测本身和清除不良路径指令会占用零个周期。

The upshot of all of this is that even if static prediction is 100% correct, it is very inefficient in the taken branch case since the front-end pipelining is defeated. If there are 6 pipeline stages between fetch and the end of decode, every taken branch causes a 6-cycle bubble in the pipeline, with the generous assumption that the prediction itself and flushing bad-path instructions take "zero cycles".

现代的x86 CPU能够在每个周期最多执行1条分支,即使对于完美预测的静态执行,也比限制要好得多。为此,预测器通常无法使用解码后可用的信息。它必须能够重定向每个周期的提取,并且仅使用在最后一次预测后延迟一个周期的可用输入。从本质上讲,这意味着预测器基本上是一个自包含的过程,仅将自己的输出用作下一个周期的预测的输入。

Modern x86 CPUs, however, are able to execute taken branches at up to 1 every cycle, much better than the limit even for perfectly predicted static execution. To achieve this, the predictor usually cannot use information available after decoding. It must be able to redirect fetch every cycle and use only inputs available with a latency of one cycle after the last prediction. Essentially, this means predictor is basically a self-contained process that uses only its own output as input for the next cycle's prediction.

这是大多数CPU上的动态预测器。它预测从下一个周期取回的位置,然后根据该预测来预测此后从该周期取回的位置,依此类推。它不使用有关已解码指令的任何信息,而仅使用分支的过去行为。它最终确实会从执行单元获得有关分支的 actual 方向的反馈,并根据该反馈更新其预测,但是在相关指令通过预测器后的许多周期中,这基本上都是异步发生的。

This is the dynamic predictor on most CPUs. It predicts where to fetch from next cycle, and then based on that prediction it predicts where to fetch from the cycle after that, and so on. It doesn't use any information about the decoded instructions, but only past behavior of the branches. It does eventually get feedback from the execution units about the actual direction of the branch, and updates its predictions based on that, but this all happens essentially asynchronously, many cycles after the relevant instruction has passed through the predictor.

所有这些都可以消除静态预测的用处。

All of this serves to chip away at the usefulness of static prediction.

首先,预测来得太晚了,因此,即使工作正常,也意味着现代Intel会冒6到8个周期的泡沫作为分支(实际上,这些是观察到的数字来自英特尔的所谓前端恢复者)。这极大地改变了成本/收益等式,根本无法做出预测。当您在获取预测之前拥有动态预测器时,您或多或少希望进行一些预测,并且即使预测精度达到51%,也可能会有所回报。

First, the prediction comes too late, so even when working perfectly it implies a bubble of 6-8 cycles on modern Intel for taken branches (indeed, these are observed figures from so-called "front-end resteers" on Intel). This dramatically changes the cost/benefit equation for making a prediction at all. When you have a dynamic predictor before fetch making a prediction, you more-or-less want to make some prediction and if it has even 51% accuracy it will probably pay off.

但是,对于静态预测,如果要进行实际预测,则需要具有较高的准确性。例如,考虑8个周期的前端恢复成本,而16个周期的完全错误预测成本。假设在某些程序中,冷反向分支的使用频率是不使用分支的两倍。这应该是静态分支预测的胜利,它可以预测反向采用,对(相对于始终预测 2 不采用的默认策略)?

For static predictions, however, you need to have high accuracy if you ever want to make a "taken" prediction. Consider, for example, an 8-cycle front-end resteer cost, versus a 16 cycle "full mispredict" cost. Let's say in some program that cold backwards branches are taken twice as often as not taken. This should be a win for static branch prediction that predicts backwards-taken, right (compared to a default strategy of always "predicting"2 not-taken)?

不是那么快!如果您假设8个周期的重新控制成本和16个周期的完全错误预测成本,那么它们最终将具有10.67个周期的相同混合成本-因为即使在正确预测的情况下,也存在8个周期的泡沫,但在静态情况下没有相应的成本。

Not so fast! If you assume an 8-cycle re-steer cost and a 16-cycle full mispredict cost, they end up having the same blended cost of 10.67 cycles - because even in the correctly predicted taken case where is an 8 cycle bubble, but in the fall-through case there is no corresponding cost for the no-static-prediction case.

此外,非静态情况下已经获得了另一半。静态预测是正确的(没有分支机构的情况),静态预测的效用没有人们想象的那么大。

Add to that that the no-static-prediction case already gets the other half of static prediction correct (the forward-branches not-taken case), the utility of static prediction is not as large as one would imagine.

为什么现在要更改?也许是因为管道的前端部分比其他部分更长了,或者是因为动态预测器的性能和存储能力的增强意味着根本没有资格进行静态预测的冷支。改善静态预测器的性能还意味着对于冷分支而言,反向采用的预测将变得不那么强大,因为动态预测器会更频繁地记住循环(这是反向采用规则的原因)。

Why the change now? Perhaps because the front-end part of the pipeline has lengthened compared to the other parts, or because the increasing performance and memory of the dynamic predictors means that fewer cold branches are eligible for static prediction at all. Improving performance of static predictors also means that the backwards-taken prediction becomes less strong for cold branches, because loops (which are the reason for the backwards-taken rule) are more frequently remembered by the dynamic predictor.

更改也可能是由于与动态预测的交互作用:不使用动态预测器的一种设计对于从未被观察到的分支完全没有分支预测资源。由于此类分支是通用的,因此可以保存很多历史记录表和 BTB 空间。但是,这种方案与静态预测器不一致,该静态预测器将反向分支预测为采用:如果从不采用反向分支,则您不希望静态预测器选择该分支并将其预测为已采用,从而弄乱了您的行为节省未使用分支机构资源的策略。

The change could also be because of an interaction with dynamic prediction: one design for a dynamic predictor is not to use any branch prediction resources at all for a branch that is never observed to be taken. Since such branches are common, this can save a lot of history table and BTB space. However, such a scheme is inconsistent with a static predictor that predicts backwards branches as taken: if a backwards branch is never taken, you don't want the static predictor to pick up this branch, and predict it as taken and so messing up your strategy of saving resources for not-taken branches.

1 ...然后做更多类似 retire 之类的事情,但是执行之后发生的事情对于我们这里的目的而言并不重要。

1 ... and also then do more stuff like retire, them - but what happens after execute mostly isn't important for our purposes here.

2 我在这里在吓in语中加上预测,因为在某种程度上它甚至都不是预测:在没有任何相反的预测的情况下,fetch和解码的默认行为是不采用,所以如果完全不要进行任何静态预测,动态预测变量也不会告诉您。

2 I put "predicting" in scare-quotes here because in a way it's not even predicting: not-taken is the default behavior of fetch and decode in the absence of any prediction to the contrary, so it's what you get if you don't put in any static prediction at all, and your dynamic predictor doesn't tell you otherwise.

这篇关于这些年来,英特尔为何改变静态分支预测机制?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆