分支预测对性能的影响? [英] Effects of branch prediction on performance?

查看:112
本文介绍了分支预测对性能的影响?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我写一些需要快速工作的紧密循环时,我常常被关于处理器分支预测将如何表现的想法所困扰.例如,我尽力避免在最内部的循环中使用if语句,尤其是在结果不一致的情况下(例如,随机地得出true或false).

之所以这样做,是因为处理器已经预取了一些常识,并且如果事实证明它对分支的预测是错误的,那么预取就没有用了.

我的问题是-这真的是现代处理器的问题吗?分支预测的预期效果如何?
可以使用哪些编码模式来使其更好?

(为便于讨论,假设我已经超出了尽早优化是万恶之源"的阶段)

解决方案

这些天,分支预测非常好.但这并不意味着可以消除分支机构的损失.

在典型的代码中,您可能会获得超过99%的正确预测,但是对性能的影响仍然很大.有几个因素在起作用.

一个是简单的分支等待时间.在普通的PC CPU上,对于错误预测,可能约为12个周期,对于正确预测的分支,可能约为1个周期.为了争辩,让我们假设您的所有分支都经过正确预测,那么您就可以自由回家了,对吗?不完全是.

分支的简单存在抑制了许多优化. 编译器无法跨分支有效地重新排序代码.在一个基本块(即按顺序执行的代码块,没有分支,一个入口和一个出口)中,只要保留了代码的含义,它就可以按自己的喜好对指令进行重新排序,因为它们迟早会全部执行.跨分支机构,将变得更加棘手.我们可以将这些指令向下移动以在该分支之后执行,但是如何保证它们得到执行?将它们放在两个分支中?那是额外的代码大小,也很麻烦,而且如果我们想跨多个分支重新排序,它就不会扩展.

即使具有最好的分支预测,分支仍然可能是昂贵的.不仅因为错误的预测,还因为指令调度变得如此困难.

这还意味着,重要的因素是分支之间的块中有多少代码,而不是分支的数量.每隔一行的分支都是不好的,但是如果您可以在分支之间的块中插入十几行,则可能有可能合理地安排了这些指令,因此该分支不会对CPU或编译器造成过多限制.

但是在典型的代码中,分支基本上是免费的.在典型的代码中,没有 许多分支在性能关键的代码中紧密聚集在一起.

When I'm writing some tight loop that needs to work fast I am often bothered by thoughts about how the processor branch prediction is going to behave. For instance I try my best to avoid having an if statement in the most inner loop, especially one with a result which is not somewhat uniform (say evaluates to true or false randomly).

I tend to do that because of the somewhat common knowledge that the processor pre-fetches instructions and if it turned out that it mis-predicted a branch then the pre-fetch is useless.

My question is - Is this really an issue with modern processors? How good can branch prediction expected to be?
What coding patterns can be used to make it better?

(For the sake of the discussion, assume that I am beyond the "early-optimization is the root of all evil" phase)

解决方案

Branch prediction is pretty darned good these days. But that doesn't mean the penalty of branches can be eliminated.

In typical code, you probably get well over 99% correct predictions, and yet the performance hit can still be significant. There are several factors at play in this.

One is the simple branch latency. On a common PC CPU, that might be in the order of 12 cycles for a mispredict, or 1 cycle for a correctly predicted branch. For the sake of argument, let's assume that all your branches are correctly predicted, then you're home free, right? Not quite.

The simple existence of a branch inhibits a lot of optimizations. The compiler is unable to reorder code efficiently across branches. Within a basic block (that is, a block of code that is executed sequentially, with no branches, one entry point and one exit), it can reorder instructions as it likes, as long as the meaning of the code is preserved, because they'll all be executed sooner or later. Across branches, it gets trickier. We could move these instructions down to execute after this branch, but then how do we guarantee they get executed? Put them in both branches? That's extra code size, that's messy too, and it doesn't scale if we want to reorder across more than one branch.

Branches can still be expensive, even with the best branch prediction. Not just because of mispredicts, but because instruction scheduling becomes so much harder.

This also implies that rather than the number of branches, the important factor is how much code goes in the block between them. A branch on every other line is bad, but if you can get a dozen lines into a block between branches, it's probably possible to get those instructions scheduled reasonably well, so the branch won't restrict the CPU or compiler too much.

But in typical code, branches are essentially free. In typical code, there aren't that many branches clustered closely together in performance-critical code.

这篇关于分支预测对性能的影响?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆