CPU 架构演变如何影响虚拟函数调用性能? [英] How has CPU architecture evolution affected virtual function call performance?

查看：42 发布时间：2022/1/6 12:46:45 x86 cpu cpu-architecture virtual-functions branch-prediction

本文介绍了CPU 架构演变如何影响虚拟函数调用性能?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

多年前，我正在学习 x86 汇编器、CPU 流水线、缓存未命中、分支预测和所有爵士乐.

Years ago I was learning about x86 assembler, CPU pipelining, cache misses, branch prediction, and all that jazz.

这是一个分为两部分的故事.我阅读了处理器中冗长管道的所有奇妙优势，即指令重新排序、缓存预加载、依赖交错等.

It was a tale of two halves. I read about all the wonderful advantages of the lengthy pipelines in the processor viz instruction reordering, cache preloading, dependency interleaving, etc.

缺点是任何偏离规范的代价都非常大.例如，IIRC 早期千兆赫时代的某个 AMD 处理器每次通过指针 (!) 调用函数时都会受到 40 个周期 惩罚，这显然是正常的.

The downside was that any deviation for the norm was enormously costly. For example, IIRC a certain AMD processor in the early-gigahertz era had a 40 cycle penalty every time you called a function through a pointer (!) and this was apparently normal.

这不是一个可以忽略的不用担心"的数字！请记住，好的设计"通常意味着尽可能多地考虑您的功能"和在数据类型"，通常意味着虚拟接口.

This is not a negligible "don't worry about it" number! Bear in mind that "good design" normally means "factor your functions as much as possible" and "encode semantics in the data types" which often implies virtual interfaces.

权衡是不执行此类操作的代码每个周期可能会获得两条以上的指令.这些是在编写高性能 C++ 代码时需要担心的数字，这些代码重对象设计而轻数字运算.

The trade-off is that code which doesn't perform such operations might get more than two instructions per cycle. These are numbers one wants to worry about when writing high-performance C++ code which is heavy on the object design and light on the number crunching.

据我所知，随着我们进入低功耗时代，长 CPU 流水线趋势已经发生逆转.这是我的问题:

I understand that the long-CPU-pipeline trend has been reversing as we enter the low-power era. Here's my question:

最新一代的 x86 兼容处理器是否仍会因虚函数调用、错误的分支预测等而遭受巨大损失?

Does the latest generation of x86-compatible processors still suffer massive penalties for virtual function calls, bad branch predictions, etc?

推荐答案

千兆赫早期的 AMD 处理器每次调用一个函数都会有 40 个周期的惩罚

AMD processor in the early-gigahertz era had a 40 cycle penalty every time you called a function

嗯.. 这么大..

有一个间接分支预测"方法，这有助于预测虚函数跳转，如果前段时间有相同的间接跳转.对于第一个和错误预测的 virt 仍然有惩罚.函数跳转.

There is an "Indirect branch prediction" method, which helps to predict virtual function jump, IF there was the same indirect jump some time ago. There is still a penalty for first and mispredicted virt. function jump.

支持从简单的当且仅当前一个间接分支完全相同时预测正确"到预测正确"不同.到非常复杂的两级数十或数百条目，并检测单个间接 jmp 指令的 2-3 个目标地址的周期性交替.

Support varies from simple "predicted right if and only if the previous indirect branch was exactly the same" to very complex two-level tens or hundreds entries with detecting of periodic alternation of 2-3 target address for single indirect jmp instruction.

这里有很多进化......

There was a lot of evolution here...

http://arstechnica.com/hardware/news/2006/04/core.ars/7

首先与 Pentium M 一起引入:...间接分支预测器.

first introduced with the Pentium M: ... indirect branch predictor.

间接分支预测器

因为间接分支从寄存器加载它们的分支目标，而不是像直接分支那样让它们立即可用，所以众所周知，它们很难预测.Core 的间接分支预测器是一个表，其中存储了前端遇到的每个间接分支的首选目标地址的历史信息.因此，当前端遇到一个间接分支并预测它被采用时，它可以要求间接分支预测器将其定向到该分支可能想要的 BTB 中的地址.

Because indirect branches load their branch targets from a register, instead of having them immediately available as is the case with direct branches, they're notoriously difficult to predict. Core's indirect branch predictor is a table that stores history information about the preferred target addresses of each indirect branch that the front end encounters. Thus when the front-end encounters an indirect branch and predicts it as taken, it can ask the indirect branch predictor to direct it to the address in the BTB that the branch will probably want.

http://www.realworldtech.com/page.cfm?ArticleID=rwt051607033728&p=3

间接分支预测首先在 Intel 的 Prescott 微架构和后来的 Pentium M 中引入.

Indirect branch prediction was first introduced with Intel’s Prescott microarchitecture and later the Pentium M.

16-50% 的分支错误预测是间接的(平均 29%).间接分支错误预测的真正价值在于许多使用解释器的较新的脚本或高级语言，例如 Ruby、Perl 或 Python.其他常见的间接分支常见罪魁祸首包括虚函数(在 C++ 中使用)和对函数指针的调用.

between 16-50% of all branch mispredicts were indirect (29% on average). The real value of indirect branch misprediction is for many of the newer scripting or high level languages, such as Ruby, Perl or Python, which use interpreters. Other common indirect branch common culprits include virtual functions (used in C++) and calls to function pointers.

http://www.realworldtech.com/page.cfm?ArticleID=RWT102808015436&p=5

AMD 采用了其中的一些改进；例如在巴塞罗那和更高版本的处理器中添加间接分支预测器数组.但是，与 Core 2 相比，K8 的分支预测器更旧且准确度更低.

AMD has adopted some of these refinements; for instance adding indirect branch predictor arrays in Barcelona and later processors. However, the K8 has older and less accurate branch predictors than the Core 2.

http://www.agner.org/optimize/microarchitecture.pdf

3.12 旧处理器上的间接跳转间接跳转、间接调用和返回可能每次都转到不同的地址.这间接跳转或间接调用的预测方法是，在早于 PM 的处理器中，并且K10，只是为了预测它会去到和上次执行时相同的目标.

3.12 Indirect jumps on older processors Indirect jumps, indirect calls, and returns may go to a different address each time. The prediction method for an indirect jump or indirect call is, in processors older than PM and K10, simply to predict that it will go to the same target as last time it was executed.

和相同的 pdf，第 14 页

and the same pdf, page 14

间接跳跃预测间接跳转或调用是具有两个以上可能的控制转移指令目标.C++ 程序可以生成间接跳转或使用...虚函数调用.在汇编中生成间接跳转或调用指定寄存器或内存变量或索引数组作为跳转的目标或调用指令.许多处理器只为间接跳转或调用创建一个 BTB 条目.这意味着它总是会被预测到与上次相同的目标.随着使用多态类的面向对象编程变得越来越普遍，越来越需要预测具有多个目标的间接调用.这是可以做到的通过为遇到的每个新跳转目标分配一个新的 BTB 条目.历史缓冲区和模式历史表必须有空间容纳多于一位的信息每次跳跃事件都是为了区分两个以上可能的目标.PM 是第一个实现此方法的 x86 处理器.p 上的预测规则.12 还在适用于可以预测的理论最大周期的修改完全是 mn，其中 m 是每次间接跳跃的不同目标的数量，因为有是 mn 个不同的可能的 n 长度子序列.然而，这个理论最大值不能如果超过 BTB 或模式历史表的大小，则达到.

Indirect jump prediction An indirect jump or call is a control transfer instruction that has more than two possible targets. A C++ program can generate an indirect jump or call with... a virtual function. An indirect jump or call is generated in assembly by specifying a register or a memory variable or an indexed array as the destination of a jump or call instruction. Many processors make only one BTB entry for an indirect jump or call. This means that it will always be predicted to go to the same target as it did last time. As object oriented programming with polymorphous classes has become more common, there is a growing need for predicting indirect calls with multiple targets. This can be done by assigning a new BTB entry for every new jump target that is encountered. The history buffer and pattern history table must have space for more than one bit of information for each jump incident in order to distinguish more than two possible targets. The PM is the first x86 processor to implement this method. The prediction rule on p. 12 still applies with the modification that the theoretical maximum period that can be predicted perfectly is mn, where m is the number of different targets per indirect jump, because there are mn different possible n-length subsequences. However, this theoretical maximum cannot be reached if it exceeds the size of the BTB or the pattern history table.

Agner 的手册对许多现代 CPU 中的分支预测器以及每个制造商 (x86/x86_64) 的 CPU 中预测器的演变进行了更长的描述.

Agner's manual has a longer description of branch predictor in many modern CPUs and the evolution of predictor in cpus of every manufacturer (x86/x86_64).

还有很多理论上的间接分支预测"方法(在谷歌学者中查找)；甚至维基也说了一些关于它的词http://en.wikipedia.org/wiki/Branch_predictor#Prediction_of_indirect_jumps/

Also a lot of theoretical "indirect branch prediction" methods (look in the Google scholar); even wiki said some words about it http://en.wikipedia.org/wiki/Branch_predictor#Prediction_of_indirect_jumps /

对于来自 agner 微型的原子:

For Atoms from the agner's micro:

间接分支的预测根据我的测试，Atom 没有间接分支的模式预测器.间接分支预计会去与上次相同的目标.

Prediction of indirect branches The Atom has no pattern predictor for indirect branches according to my tests. Indirect branches are predicted to go to the same target as last time.

因此，对于低功耗，间接分支预测并不是那么先进.Via Nano 也是如此:

So, for low power, indirect branch prediction is not so advanced. So does Via Nano:

预计间接跳跃会到达与上次相同的目标.

Indirect jumps are predicted to go to the same target as last time.

我认为，低功耗 x86 的较短管道的惩罚较低，7-20 个滴答声.

I think, that shorter pipeline of lowpower x86 has lower penalty, 7-20 ticks.

这篇关于CPU 架构演变如何影响虚拟函数调用性能?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

CPU 架构演变如何影响虚拟函数调用性能? [英] How has CPU architecture evolution affected virtual function call performance?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

CPU 架构演变如何影响虚拟函数调用性能? [英] How has CPU architecture evolution affected virtual function call performance?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭