这些年英特尔为什么要改变静态分支预测机制? [英] Why did Intel change the static branch prediction mechanism over these years?

查看:29
本文介绍了这些年英特尔为什么要改变静态分支预测机制?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这里我知道英特尔实现了几个静态分支预测这些年的机制:

From here I know Intel implemented several static branch prediction mechanisms these years:

  • 80486 年龄:始终未采取

  • 80486 age: Always-not-taken

Pentium4 时代:Backwards Takes/Forwards Not-Taken

Pentium4 age: Backwards Taken/Forwards Not-Taken

Ivy Bridge、Haswell 等较新的 CPU 变得越来越无形,请参阅 Matt G 的实验在这里.

Newer CPUs like Ivy Bridge, Haswell have become increasingly intangible, see Matt G's experiment here.

而且英特尔似乎不想再谈论它了,因为我在英特尔文档中找到的最新材料是大约十年前写的.

And Intel seems don't want to talk about it any more, because the latest material I found within Intel Document was written about ten years ago.

我知道静态分支预测(远?)不如动态重要,但在相当多的情况下,CPU 将完全丢失,程序员(使用编译器)通常是最好的指导.当然这些情况通常不是性能瓶颈,因为一旦一个分支被频繁执行,动态预测器就会捕捉到它.

I know static branch prediction is (far?) less important than dynamic, but in quite a few situations, CPU will be completely lost and programmers(with compiler) are usually the best guide. Of course these situations are usually not performance bottleneck, because once a branch is frequently executed, the dynamic predictor will capture it.

由于 Intel 在其文档中不再明确说明动态预测机制,GCC 的 builtin_expect() 只能从热路径中删除不太可能的分支.

Since Intel no longer clearly statements the dynamic prediction mechanism in its document, the builtin_expect() of GCC can do nothing more than removing the unlikely branch from hot path.

我不熟悉 CPU 设计,也不知道英特尔现在使用的静态预测器究竟是什么机制,但我仍然觉得英特尔最好的机制应该是清楚地记录他的 CPU '我打算什么时候去哪里动态预测器失败,前进或后退",因为通常程序员是当时最好的向导.

I am not familiar with CPU design and I don't know what exactly mechanism Intel use nowadays for its static predictor, but I still feel the best mechanism for Intel should be to clearly document his CPU 'where I plan to go when dynamic predictor failed, forward or backward', because usually the programmer is the best guide at that time.

更新:
我发现你提到的话题逐渐超出了我的知识范围.这里涉及到一些动态预测机制和CPU内部细节,两三天都学不来.所以请允许我暂时退出你的讨论并充电.
这里仍然欢迎任何答案,也许会帮助更多人

推荐答案

静态预测在现代设计中不受青睐甚至可能不存在的主要原因是静态预测发生得太晚了与动态预测相比,在管道中.基本问题是分支方向和目标位置必须在获取它们之前知道,但静态预测只能在解码之后(在获取之后)进行.

The primary reason why static prediction is not favored in modern designs, to the point of perhaps not even being present, is that static predictions occur too late in the pipeline compared to dynamic predictions. The basic issue is that branch directions and target locations must be known before fetching them, but static predictions can only be made after decode (which comes after fetch).

更详细...

简而言之,在执行过程中需要从内存中获取指令,解码这些指令,然后执行它们1.在高性能 CPU 上,这些阶段将流水线化,这意味着它们通常都将并行发生 - 但在任何给定时刻针对不同的指令.您可以在在 Wikipedia 上阅读一些有关此内容的信息,但请记住,现代 CPU 更复杂,通常有更多的阶段.

Briefly, during execution needs to fetch instructions from memory, decode those instructions and then execute them1. On a high-performance CPU, these stages will be pipelined, meaning that they will all generally be happening in parallel - but for different instructions at any given moment. You could read a bit about this on Wikipedia, but keep in mind that modern CPUs are more complex, generally with many more stages.

在具有复杂解码可变长度指令集的现代 x86 上,可能有许多管道阶段"仅涉及获取和解码指令,可能有六个或更多.此类指令也是超标量,能够同时执行多条指令.这意味着当以最高效率执行时,将有许多指令在运行,处于获取、解码、执行等各个阶段.

On a modern x86, with a complex-to-decode variable-length instruction set, there may be many pipeline "stages" involved simply in fetching and decoding instructions, perhaps a half-dozen or more. Such instructions are also superscalar, capable of executing several instructions at once. This implies that when executing at peak efficiency, there will be many instructions in flight, in various stages of being fetched, decoded, executed and so on.

在管道的整个初始部分(通常称为前端)上都会感受到采用分支的影响:当您跳转到一个新地址时,您需要从该新地址获取地址,从那个新地址解码,等等.我们说一个被占用的分支需要redirect fetch.这对分支预测可以用来高效执行的信息设置了一定的限制.

The effect of a taken branch is felt on the entire initial portion (usually called the front-end) of the pipeline: when you jump to a new address, you need to fetch from that new address, decode from that new address, etc. We say that a taken branch needs to redirect fetch. This puts certain restrictions on the information that branch prediction can use in order to perform efficiently.

考虑静态预测的工作原理:它查看指令,如果它是分支,则比较其目标以查看它是向前"还是向后".所有这一切都必须在很大程度上解码发生之后发生,因为那是知道实际指令的时候.但是,如果检测到分支并预测采用(例如,向后跳转),则预测器需要重定向获取,这是更早的许多管道阶段.在解码指令 N 后重定向获取时,已经有许多后续指令在错误(未采用)路径上被提取和解码.那些必须扔掉.我们说在前端引入了一个气泡.

Consider how static prediction works: it looks at the instruction and if it is a branch, compares its target to see if it is "forwards" or "backwards". All this must happen largely after decoding has occurred, since that's when the actual instruction is known. However, if a branch is detected and is predicted taken (e.g., a backwards jump), the predictor needs to redirect fetch, which is many pipeline stages earlier. By the time fetch gets redirected after decoding instruction N there are already many subsequent instructions that were fetched and decoded on the wrong (not taken) path. Those have to be thrown away. We say that a bubble is introduced in the front-end.

所有这一切的结果是,即使静态预测是 100% 正确的,由于前端流水线被击败,它在采用分支的情况下效率非常低.如果在 fetch 和解码结束之间有 6 个流水线阶段,则每个采用的分支都会在流水线中产生一个 6 周期的气泡,并假设预测本身和刷新坏路径指令需要零周期".

The upshot of all of this is that even if static prediction is 100% correct, it is very inefficient in the taken branch case since the front-end pipelining is defeated. If there are 6 pipeline stages between fetch and the end of decode, every taken branch causes a 6-cycle bubble in the pipeline, with the generous assumption that the prediction itself and flushing bad-path instructions take "zero cycles".

然而,现代 x86 CPU 能够在每个周期执行多达 1 个分支,即使对于完美预测的静态执行,也比限制要好得多.为此,预测器通常不能使用解码后可用的信息.它必须能够重定向每个周期的获取,并且仅使用在最后一次预测之后具有一个周期延迟的可用输入.本质上,这意味着预测器基本上是一个自包含的过程,它只使用自己的输出作为下一个周期预测的输入.

Modern x86 CPUs, however, are able to execute taken branches at up to 1 every cycle, much better than the limit even for perfectly predicted static execution. To achieve this, the predictor usually cannot use information available after decoding. It must be able to redirect fetch every cycle and use only inputs available with a latency of one cycle after the last prediction. Essentially, this means predictor is basically a self-contained process that uses only its own output as input for the next cycle's prediction.

这是大多数 CPU 上的动态预测器.它预测从下一个循环中获取的位置,然后基于该预测预测从之后的循环中获取的位置,依此类推.它不使用有关解码指令的任何信息,而仅使用分支的过去行为.它最终确实从执行单元获得关于分支的实际方向的反馈,并基于此更新其预测,但这一切基本上都是异步发生的,相关指令通过预测器后的许多周期.

This is the dynamic predictor on most CPUs. It predicts where to fetch from next cycle, and then based on that prediction it predicts where to fetch from the cycle after that, and so on. It doesn't use any information about the decoded instructions, but only past behavior of the branches. It does eventually get feedback from the execution units about the actual direction of the branch, and updates its predictions based on that, but this all happens essentially asynchronously, many cycles after the relevant instruction has passed through the predictor.

所有这些都削弱了静态预测的用处.

All of this serves to chip away at the usefulness of static prediction.

首先,预测来得太晚了,所以即使工作完美,它也意味着现代英特尔上的分支会出现 6 到 8 个周期的泡沫(事实上,这些是英特尔所谓的前端重启者"观察到的数据).这极大地改变了进行预测的成本/收益等式.当你在 fetch 做出预测之前有一个动态预测器时,你或多或少想要做出一些预测,如果它甚至有 51% 的准确度,它可能会得到回报.

First, the prediction comes too late, so even when working perfectly it implies a bubble of 6-8 cycles on modern Intel for taken branches (indeed, these are observed figures from so-called "front-end resteers" on Intel). This dramatically changes the cost/benefit equation for making a prediction at all. When you have a dynamic predictor before fetch making a prediction, you more-or-less want to make some prediction and if it has even 51% accuracy it will probably pay off.

但是,对于静态预测,如果您想要进行采用"预测,则需要具有较高的准确度.例如,考虑 8 周期的前端再转向成本与 16 周期的完全错误预测"成本.假设在某个程序中,冷的反向分支被采用的频率是不采用的两倍.这应该是预测向后采取的静态分支预测的胜利,对吧(与始终预测"2 未采取的默认策略相比)?

For static predictions, however, you need to have high accuracy if you ever want to make a "taken" prediction. Consider, for example, an 8-cycle front-end resteer cost, versus a 16 cycle "full mispredict" cost. Let's say in some program that cold backwards branches are taken twice as often as not taken. This should be a win for static branch prediction that predicts backwards-taken, right (compared to a default strategy of always "predicting"2 not-taken)?

没那么快!如果您假设 8 个周期的重新转向成本和 16 个周期的完全错误预测成本,它们最终具有相同的 10.67 个周期的混合成本 - 因为即使在正确预测的情况下,如果是 8 个周期的泡沫,但在失败的情况下,无静态预测的情况没有相应的成本.

Not so fast! If you assume an 8-cycle re-steer cost and a 16-cycle full mispredict cost, they end up having the same blended cost of 10.67 cycles - because even in the correctly predicted taken case where is an 8 cycle bubble, but in the fall-through case there is no corresponding cost for the no-static-prediction case.

再加上无静态预测的情况已经得到了静态预测的另一半正确(前向分支未采用的情况),静态预测的效用并没有人们想象的那么大.

Add to that that the no-static-prediction case already gets the other half of static prediction correct (the forward-branches not-taken case), the utility of static prediction is not as large as one would imagine.

为什么现在改变?可能是因为管道的前端部分与其他部分相比有所延长,或者因为动态预测器的性能和内存增加意味着更少的冷分支完全符合静态预测的条件.提高静态预测器的性能还意味着冷分支的后向预测变得不那么强,因为动态预测器更频繁地记住循环(这是后向采用规则的原因).

Why the change now? Perhaps because the front-end part of the pipeline has lengthened compared to the other parts, or because the increasing performance and memory of the dynamic predictors means that fewer cold branches are eligible for static prediction at all. Improving performance of static predictors also means that the backwards-taken prediction becomes less strong for cold branches, because loops (which are the reason for the backwards-taken rule) are more frequently remembered by the dynamic predictor.

这种变化也可能是由于与动态预测的交互:动态预测器的一种设计是根本不将任何分支预测资源用于从未观察到被采用的分支.由于这样的分支很常见,这样可以节省很多历史表和 BTB 空间.但是,这样的方案与将反向分支预测为采用的静态预测器不一致:如果从不采用反向分支,则您不希望静态预测器选择该分支,并将其预测为采用,从而弄乱您的为未采取的分支节省资源的策略.

The change could also be because of an interaction with dynamic prediction: one design for a dynamic predictor is not to use any branch prediction resources at all for a branch that is never observed to be taken. Since such branches are common, this can save a lot of history table and BTB space. However, such a scheme is inconsistent with a static predictor that predicts backwards branches as taken: if a backwards branch is never taken, you don't want the static predictor to pick up this branch, and predict it as taken and so messing up your strategy of saving resources for not-taken branches.

1 ...然后做更多的事情,比如retire,他们——但是执行之后发生的事情对于我们这里的目的来说并不重要.

1 ... and also then do more stuff like retire, them - but what happens after execute mostly isn't important for our purposes here.

2 我在这里把预测"放在了吓人的引号中,因为在某种程度上它甚至没有预测:在没有任何相反预测的情况下,未采取是 fetch 和 decode 的默认行为,因此,如果您根本不进行任何静态预测,就会得到这样的结果,而您的动态预测器不会告诉您其他情况.

2 I put "predicting" in scare-quotes here because in a way it's not even predicting: not-taken is the default behavior of fetch and decode in the absence of any prediction to the contrary, so it's what you get if you don't put in any static prediction at all, and your dynamic predictor doesn't tell you otherwise.

这篇关于这些年英特尔为什么要改变静态分支预测机制?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆