即使对于很短的if语句主体,分支预测错误也会刷新整个管道吗? [英] Does a branch misprediction flush the entire pipeline, even for very short if-statement body?

查看:130
本文介绍了即使对于很短的if语句主体,分支预测错误也会刷新整个管道吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我读过的所有内容似乎都表明分支预测错误总是导致整个管道被刷新,这意味着浪费了很多周期。我从没听过有人提到任何简短的if条件例外。

Everything I've read seems to indicate that a branch misprediction always results in the entire pipeline being flushed, which means a lot of wasted cycles. I never hear anyone mention any exceptions for short if-conditions.

在某些情况下,这似乎真的很浪费。例如,假设您有一个独立的if语句,该语句具有非常简单的主体,可编译为1条CPU指令。如果一条子句将被一条指令编译成条件跳转。如果CPU预测分支不会被占用,它将开始执行if-body指令,并可以立即开始执行以下指令。现在,一旦对if-condition的评估达到了流水线的末端(可能是12个周期之后),CPU现在就知道其预测是对还是错。如果预测错误,并且实际上已执行了分支,则CPU实际上只需要从管道中丢弃1条指令(if-body中的一条)。但是,如果它冲洗了整个管道,那么按照以下说明完成的所有工作也将被浪费掉,并且无缘无故地必须重复执行。

This seems like it would be really wasteful in some cases. For example, suppose you have a lone if-statement with a very simple body that is compiled down to 1 CPU instruction. The if-clause would be compiled into a conditional jump forward by one instruction. If the CPU predicts the branch to not be taken, then it will begin executing the if-body instruction, and can immediately begin executing the following instructions. Now, once evaluation of the if-condition has reached the end of the pipeline, which could be, say, 12 cycles later, the CPU now knows whether it's prediction was right or wrong. If it mispredicted, and the branch was actually taken, then the CPU really only has to discard 1 instruction from the pipeline (the one in the if-body). However, if it flushes the entire pipeline, then all the work that was done on the following instructions was wasted as well, and will have to be repeated for no reason. That's a lot of wasted cycles on a deeply pipelined architecture.

那么,现代的CPU是否有任何机制可以只丢弃短if体内的少量指令呢?还是真的冲洗了整个管道?如果是后者,那么我想使用条件移动指令会获得更好的性能。顺便说一句,有谁知道现代编译器是否擅长将简短的if语句转换为cmov指令?

So do modern CPUs have any mechanism to discard only the few instructions that are inside of a short if-body? Or does it really flush the entire pipeline? If it's the latter, then I suppose using a conditional move instruction would get better performance. As an aside, does anyone know if modern compilers are good at converting short if-statements into cmov instructions?

推荐答案

大多数通用处理器确实会在分支预测错误的情况下刷新管道。除了对分支预测的广泛研究(以及对分支预测的广泛研究)以外,条件分支对性能的负面影响还激发了积极执行的建议(在这两个路径中均执行,随后选择正确的路径)和动态预测(在其中确定分支阴影中的指令)。以及其他技术)。 ( Mark Smotherman关于急切执行的页面提供了一些详细信息和参考。我会添加Hyesoon Kim等人的重要论文 Wish Branches:结合条件分支和谓词进行自适应谓词执行,2005年。

Most general purpose processors do flush the pipeline on a branch misprediction. The negative performance impact of conditional branches has motivated proposals for eager execution (where both paths are executed and the correct path selected later) and dynamic predication (where instructions in the branch shadow are predicated) in addition to extensive research on branch prediction (as well as other techniques). (Mark Smotherman's page on eager execution provides some details and references. I would add Hyesoon Kim et al.'s "Wish Branches: Combining Conditional Branching and Predication for Adaptive Predicated Execution", 2005, as a significant paper.)

IBM的POWER7似乎是第一个主流处理器实现比预取备用路径(即,渴望获取)更复杂的功能,并且仅处理单个指令情况。 (POWER7使用分支预测置信度估计来选择是使用谓词还是使用预测。)

IBM's POWER7 seems to be the first mainstream processor to implement anything more sophisticated than prefetching an alternate path (i.e., eager fetch), and it only handles the single instruction case. (POWER7 uses a branch prediction confidence estimate to choose whether to predicate or use prediction.)

急于执行具有明显的资源使用爆炸性问题。即使基于分支预测的置信度,推测深度和资源可用性(前端可用的信息)的选择性渴望,也可以更有效地推测单个路径的更深层次。发现多个路径的连接点并避免过多的冗余计算也会增加复杂性。 (理想情况下,与控制无关的操作将只执行一次,并且将优化连接和数据流,但是这种优化会增加复杂性。)

Eager execution has the obvious problem of exploding resource use. Even with selective eagerness based on branch prediction confidence, speculation depth, and resource availability (information available to the front-end), it can easily be more effective to speculate deeper down a single path. Discovering the joining points of multiple paths and avoiding excessive redundant computation can also add complexity. (Ideally, control independent operations would only be executed once and joining and data flow would be optimized, but such optimization adds complexity.)

对于深度流水线的有序处理器,将短前向分支预测为未采用,而仅在管道中向后刷新到实际采用的分支时才采用该分支的目标指令,这似乎很有吸引力。如果一次仅在管道中允许一个这样的分支(其他分支使用预测),则向每条指令添加一位可以控制该指令是转换为nop还是执行。 (如果只处理一条指令的分支情况,则允许在管道中进行多个分支可能并不是特别复杂。)

For a deeply pipelined in-order processor, it may seem attractive to predict short forward branches as not taken and only flush backward in the pipeline to the instruction targeted by the taken branch when the branch is actually taken. If only one such branch is allowed in the pipeline at a time (other branches uses prediction), adding a single bit to each instruction could control whether it is converted to a nop or executed. (If only the case of a single instruction being branched over is handled, allowing multiple branches in the pipeline might not be especially complex.)

这与annul-如果采用分支延迟时隙。如果未采用 ,则MIPS具有可能分支指令,该指令将被废止,并且在修订版2.62中将其标记为过时。尽管这样做的某些理由大概是将实现与接口分离以及恢复指令编码空间的愿望,但这一决定也暗示了该概念存在一些问题。

This would be similar to annul-if-taken branch delay slots. MIPS has "Branch Likely" instructions that annulled if not taken, and these are marked as obsolete in Revision 2.62. While some of the justification for such is presumably to separate implementation from interface and the desire to recover instruction encoding space, this decision also hints that the concept has some issues.

如果对所有短前向分支都执行了此操作,当正确预测分支被采用时,它将丢弃指令。 (请注意,如果采用的分支始终经历提取重定向延迟,则这种惩罚可能会更少,这在深度流水线处理器中的多周期指令高速缓存访​​问中更有可能发生。在这种情况下,就像没有分支一样进行获取与正确预测的采用分支具有相同的性能,但是,有人可能会争辩说处理器在特殊情况下使用了如此短的采用分支以最大程度地减少了获取气泡。)

If this was done for all short forward branches, it would throw away instructions when the branch was correctly predicted as taken. (Note that this penalty could be less if taken branches always experience a delay in fetch redirection, which would be more likely with a multi-cycle instruction cache access in a deeply pipelined processor. In that case, fetching as if there was no branch could have the same performance as a correctly predicted taken branch. However, one could argue that the processor special case such short taken branches to minimize such fetch bubbles.)

考虑一个标量流水线(每个周期的非分支指令等于1.0),在第八级末尾具有分支分辨率,并且在正确预测的已采用分支上没有取回重定向损失,从而处理单指令分支转换。假设对于这样的短前向分支(指令的2%,占30%的时间),分支预测器的准确度为75%(无方向偏差),其他分支的预测精度为93%(指令的18%)。对于被错误预测为已使用的短分支(该分支的17.5%;指令的0.35%),将保存八个周期;如果被错误预测为未采用的短分支将被保存七个周期(7.2%; 0.144%),而如果正确则将丢失一个周期。预测为接受(22.5%; 0.45%)。每条指令总共可以节省0.03358个周期。如果不进行这种优化,则每条指令的周期将为1.2758。

As an example consider a scalar pipeline (non-branch instructions per cycle equal to 1.0) with branch resolution at the end of the eighth stage and no fetch redirection penalty on correctly predicted taken branches, handling single-instruction branch-overs. Assume 75% branch predictor accuracy (unbiased by direction) for such short forward branches (2% of instructions, taken 30% of the time) and 93% accuracy for other branches (18% of instructions). Eight cycles would be saved for short branches that would be mispredicted as taken (17.5% of such branches; 0.35% of instructions), seven cycles when mispredicted as not taken (7.2%; 0.144%), and one cycle would be lost when correctly predicted as taken (22.5%; 0.45%). In total 0.03358 cycles per instruction would be saved. Without this optimization the cycles per instruction would be 1.2758.

(虽然上述数字仅是举例,但除了1.0 IPC对于非-分支指令。提供较小的循环高速缓存将减少错误预测的损失(并在短循环中节省功耗),因为指令高速缓存的访问可能是八个周期中的三个周期。添加高速缓存未命中的影响将进一步减少该分支的百分比提高

(While the above numbers are just for example, they are probably not far from reality except for the 1.0 IPC for non-branch instructions. Providing a small loop cache would reduce the misprediction penalty (and save power in short loops) because instruction cache access would probably be three of the eight cycles. Adding the effect of cache misses would further reduce the percentage improvement from this branch optimization. Avoiding the overhead for predicted "strongly taken" short branches might be worthwhile.)

为了使设计倾向于使用狭窄和较浅的管道,可以避免不必要的开销。并且更喜欢简单性(以降低设计,功耗和面积成本)。由于指令集可能在许多短分支情况下都支持无分支代码,因此进一步优化这一方面的动机就进一步降低。

In order designs tend to use narrow and shallower pipelines and prefer simplicity (for lower design, power, and area costs). Since the instruction set is likely to support branchless code for many short-branch cases, the incentive to optimize this aspect is further decreased.

对于乱序实现,由于处理器希望能够执行以后的非依赖性指令,因此必须确定潜在的分支指令。谓词引入了额外的数据依赖性,必须对其进行检查以进行调度。指令调度程序通常每条指令仅提供两个比较器并拆分条件移动(一条简单的指令仅具有三个数据流操作数:旧值,替代值和条件;谓词寄存器-寄存器add将具有四个操作数(有其他方法可以解决此问题,但是这个答案已经很长了。)

For out-of-order implementations, the potentially branched over instructions would have to be predicated since the processor would want to be able to execute later non-dependent instructions. Predication introduces an additional data dependency which must be checked for scheduling. It is common for instruction schedulers to provide only two comparators per instruction and to split a conditional move (a simple instruction with only three data flow operands: the old value, the alternative value, and the condition; a predicated register-register add would have four operands. (There are alternative ways of addressing this issue, but this answer is already long.)

乱序的实现在分支时也不会停止条件不可用。这是控制依赖和数据依赖之间的折衷。通过精确的分支预测,控制依赖非常便宜,但是数据依赖可以阻止等待数据操作数的前进。(当然,使用布尔值数据依赖性,价值预测变得更具吸引力。在某些情况下,使用谓词预测可能是合乎需要的,并且比使用动态成本和置信度估计的简单谓词具有优势。)

An out-of-order implementation would also not stall when a branch condition is not available. This is a tradeoff between a control dependency and a data dependency. With accurate branch prediction a control dependency is extremely inexpensive, but a data dependency can hold up forward progress waiting on data operands. (Of course, with a boolean data dependency, value prediction becomes somewhat more attractive. Using predicate prediction might be desirable in some cases and would have the advantage over simple predication of using dynamic cost and confidence estimates.)

(我也许是在说ARM选择放弃在64位AArch64中使用广泛的谓词。尽管其中很大一部分用于指令编码,但是高性能实现的谓词的好处可能相对较低。)

(It is perhaps telling that ARM chose to drop extensive predication in 64-bit AArch64. While a large part of this is for instruction encoding, the benefit of predication for high-performance implementations is presumably relatively low.)

无分支代码与分支代码的性能取决于分支的可预测性和其他因素(包括(如果采取的话,则将对重定向访存带来任何不利影响)),但是编译器很难确定分支的可预测性分店。即使简档数据通常也仅提供分支频率,该分支频率可以给出可悲的可预测性视图,因为这并不考虑使用局部或全局历史的分支预测器。编译器也不是很清楚数据可用性和其他动态方面的时机。如果条件晚于用于计算的操作数可用,则用数据依赖关系(预测)替换控制依赖关系(分支预测)可能会降低性能。无分支代码也可能会引入更多的实时值,从而可能增加寄存器溢出和填充开销。

The performance of branchless versus branching code depends on the predictability of the branch and other factors (including, if taken, any penalty for redirecting fetch), but it is difficult for the compiler to determine the predictability of a branch. Even profile data typically only provides branch frequencies which can give a pessimistic view of predictability since such does not account for the branch predictor using local or global history. A compiler is also not perfectly aware of timing of data availability and other dynamic aspects. If the condition is available later than the operands used for computation, then replacing a control dependence (branch prediction) with a data dependence (predication) could degrade performance. Branchless code may also introduce more live values, potentially adding register spill and fill overhead.

更复杂的是,大多数仅提供条件移动或选择指令的指令集无法提供有条件的商店。尽管可以通过有条件的移动来选择安全的,被忽略的存储位置来解决此问题,但这似乎没有什么吸引力。另外,条件移动指令通常比简单的算术指令更昂贵。加法和条件移动可能需要三个周期,而正确预测的分支和加法将花费零(如果加法分支了)或一个周期。

Complicating this further, most instruction sets that only provide conditional move or select instructions do not provide a conditional store. While this can be worked around by using conditional move to select a safe, ignored storage location, such seems an unattractive complication. In addition, conditional move instructions are often more expensive than simple arithmetic instructions; an addition and conditional move might take three cycles where a correctly predicted branch and addition would take zero (if addition is branched over) or one cycle.

更复杂的是分支预测器通常会忽略谓词操作。如果稍后保留的分支与已删除分支的条件相关,则该分支的误预测率可能会增加。 (可以使用谓词预测来保留此类已删除分支的预测器效果。)

A further complication is that predicated operations are generally ignored by the branch predictor. If a later retained branch correlates with the condition of the removed branch, the branch misprediction rate may increase for that later branch. (Predicate prediction could be used to retain the predictor effects of such removed branches.)

随着对向量化的日益重视,自分支以来,无分支代码的使用变得更加重要基于代码的代码限制了对整个向量使用操作的能力。

With the increased emphasis on vectorization, the use of branchless code becomes even more significant since branch-based code constrains the ability to use operations on an entire vector.

这篇关于即使对于很短的if语句主体,分支预测错误也会刷新整个管道吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆