CPU体系结构的演变如何影响虚拟函数调用性能? [英] How has CPU architecture evolution affected virtual function call performance?

查看:76
本文介绍了CPU体系结构的演变如何影响虚拟函数调用性能?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

几年前,我正在学习x86汇编程序,CPU流水线,缓存未命中,分支预测以及所有这些爵士乐。



这是一个两半的故事。我了解了冗长的流水线在处理器中的所有奇妙优点,即指令重排序,高速缓存预加载,依赖项交织等。



缺点是标准的任何偏差成本很高。例如,每当您通过指针(!)调用函数时,IIRC在千兆赫兹时代的某个AMD处理器都会受到 40个周期 的惩罚。 。



这不是一个微不足道的不用担心数字!请记住,好的设计通常意味着因素。尽可能多地使用您的函数和在数据类型中编码语义,这通常意味着虚拟接口。



权衡是不能执行此类操作的代码每个周期操作可能获得两个以上的指令。这些是编写高性能C ++代码时要担心的数字,这些代码在对象设计上很繁琐,而数字运算却很繁琐。



我知道随着我们进入低功耗时代,CPU流水线趋势正在逆转。这是我的问题:



最新一代的x86兼容处理器是否仍因虚拟函数调用,错误的分支预测等而遭受重罚?

解决方案


早期千兆赫时代的AMD处理器每次调用函数都会产生40个周期的惩罚


呵呵..太大了。


有一个间接分支预测如果前一段时间存在相同的间接跳转,则该方法有助于预测虚拟函数的跳转。初次和错误预测的侵权行为仍然会受到惩罚。


支持不同于简单的仅当先前的间接分支完全相同时才预测正确。到非常复杂的两级数十或数百个条目,并为单个间接jmp指令检测到2-3个目标地址的周期性交替。


这里有很多演变…… p>

http://arstechnica.com/ hardware / news / 2006/04 / core.ars / 7


奔腾M首次引入:...间接分支预测器。 / p>

间接分支预测变量


因为间接分支从寄存器加载其分支目标,而不是立即将它们用作直接分支就是这种情况,众所周知,它们很难预测。 Core的间接分支预测变量是一个表,该表存储有关前端遇到的每个间接分支的首选目标地址的历史信息。因此,当前端遇到间接分支并对其进行预测时,它可以要求间接分支预测器将其定向到分支可能想要的BTB中的地址。



http://www.realworldtech.com /page.cfm?ArticleID=rwt051607033728&p=3


间接分支预测最早是在Intel的Prescott微体系结构中引入的,随后是PentiumM。


所有分支错误预测中有16-50%是间接的(平均29%)。间接分支错误预测的真正价值在于使用解释器的许多较新的脚本或高级语言,例如Ruby,Perl或Python。其他常见的间接分支常见问题包括虚拟函数(在C ++中使用)和对函数指针的调用。


http://www.realworldtech.com/page.cfm?ArticleID=RWT102808015436&p=5


AMD采用了一些改进措施;例如在巴塞罗那和以后的处理器中添加间接分支预测变量数组。但是,K8的分支预测器比Core 2更早,准确性也不高。


http://www.agner.org/optimize/microarchitecture.pdf


3.12在较旧的处理器上进行间接跳转
每次都可能会间接跳转,间接调用和返回到另一个地址。对于早于PM和
K10的处理器,间接跳转或间接调用的
预测方法只是为了预测它会与上次执行时到达相同的目标。


和相同的pdf,第14页


间接跳转预测
间接跳转或call是控制转移指令,具有两个以上可能的
目标。 C ++程序可以使用虚拟函数生成间接跳转或调用。
在汇编中通过指定寄存器或内存变量或索引数组作为跳转
或调用指令的目的地来生成间接跳转或调用。许多处理器仅为间接跳转或调用创建一个BTB条目。
这意味着总是可以预测它将达到与上次相同的目标。随着多态类的面向对象编程越来越普遍,
越来越需要预测具有多个目标的间接调用。可以通过为遇到的每个新跳转目标分配一个新的BTB条目来完成
。历史
缓冲区和模式历史记录表必须具有空间,以便每个跳转事件
可以容纳多于一位的信息,以便区分两个以上的可能目标。
PM是第一个实现此方法的x86处理器。关于p的预测规则。 12仍然
进行了修改,可以完美地预测
的理论最大周期为mn,其中m是每个间接跳转的不同目标的数量,因为
有mn个不同的可能n长子序列。但是,如果超出了BTB或模式历史记录表的大小,则无法达到理论最大值


Agner的手册较长


还有很多理论上的间接分支预测理论,它们描述了许多现代CPU中的分支预测器,以及每个制造商的cpus中的预测器的发展(x86 / x86_64)。方法(在Google学术搜索中查看);甚至Wiki也对它说了一些话 http://en.wikipedia.org/wiki/Branch_predictor#Prediction_of_indirect_jumps /


从agner的微观角度来看原子:


间接分支的预测
Atom根据我的测试,没有间接分支的模式预测器。间接
分支预计会与上次到达相同的目标。


因此,对于低功耗而言,间接分支预测并不那么先进。 Via Nano也是如此:


预计间接跳转将达到与上次相同的目标。


我认为,较短的低功耗x86流水线具有较低的惩罚,为7-20滴答。


Years ago I was learning about x86 assembler, CPU pipelining, cache misses, branch prediction, and all that jazz.

It was a tale of two halves. I read about all the wonderful advantages of the lengthy pipelines in the processor viz instruction reordering, cache preloading, dependency interleaving, etc.

The downside was that any deviation for the norm was enormously costly. For example, IIRC a certain AMD processor in the early-gigahertz era had a 40 cycle penalty every time you called a function through a pointer (!) and this was apparently normal.

This is not a negligible "don't worry about it" number! Bear in mind that "good design" normally means "factor your functions as much as possible" and "encode semantics in the data types" which often implies virtual interfaces.

The trade-off is that code which doesn't perform such operations might get more than two instructions per cycle. These are numbers one wants to worry about when writing high-performance C++ code which is heavy on the object design and light on the number crunching.

I understand that the long-CPU-pipeline trend has been reversing as we enter the low-power era. Here's my question:

Does the latest generation of x86-compatible processors still suffer massive penalties for virtual function calls, bad branch predictions, etc?

解决方案

AMD processor in the early-gigahertz era had a 40 cycle penalty every time you called a function

Huh.. so large..

There is an "Indirect branch prediction" method, which helps to predict virtual function jump, IF there was the same indirect jump some time ago. There is still a penalty for first and mispredicted virt. function jump.

Support varies from simple "predicted right if and only if the previous indirect branch was exactly the same" to very complex two-level tens or hundreds entries with detecting of periodic alternation of 2-3 target address for single indirect jmp instruction.

There was a lot of evolution here...

http://arstechnica.com/hardware/news/2006/04/core.ars/7

first introduced with the Pentium M: ... indirect branch predictor.

The indirect branch predictor

Because indirect branches load their branch targets from a register, instead of having them immediately available as is the case with direct branches, they're notoriously difficult to predict. Core's indirect branch predictor is a table that stores history information about the preferred target addresses of each indirect branch that the front end encounters. Thus when the front-end encounters an indirect branch and predicts it as taken, it can ask the indirect branch predictor to direct it to the address in the BTB that the branch will probably want.

http://www.realworldtech.com/page.cfm?ArticleID=rwt051607033728&p=3

Indirect branch prediction was first introduced with Intel’s Prescott microarchitecture and later the Pentium M.

between 16-50% of all branch mispredicts were indirect (29% on average). The real value of indirect branch misprediction is for many of the newer scripting or high level languages, such as Ruby, Perl or Python, which use interpreters. Other common indirect branch common culprits include virtual functions (used in C++) and calls to function pointers.

http://www.realworldtech.com/page.cfm?ArticleID=RWT102808015436&p=5

AMD has adopted some of these refinements; for instance adding indirect branch predictor arrays in Barcelona and later processors. However, the K8 has older and less accurate branch predictors than the Core 2.

http://www.agner.org/optimize/microarchitecture.pdf

3.12 Indirect jumps on older processors Indirect jumps, indirect calls, and returns may go to a different address each time. The prediction method for an indirect jump or indirect call is, in processors older than PM and K10, simply to predict that it will go to the same target as last time it was executed.

and the same pdf, page 14

Indirect jump prediction An indirect jump or call is a control transfer instruction that has more than two possible targets. A C++ program can generate an indirect jump or call with... a virtual function. An indirect jump or call is generated in assembly by specifying a register or a memory variable or an indexed array as the destination of a jump or call instruction. Many processors make only one BTB entry for an indirect jump or call. This means that it will always be predicted to go to the same target as it did last time. As object oriented programming with polymorphous classes has become more common, there is a growing need for predicting indirect calls with multiple targets. This can be done by assigning a new BTB entry for every new jump target that is encountered. The history buffer and pattern history table must have space for more than one bit of information for each jump incident in order to distinguish more than two possible targets. The PM is the first x86 processor to implement this method. The prediction rule on p. 12 still applies with the modification that the theoretical maximum period that can be predicted perfectly is mn, where m is the number of different targets per indirect jump, because there are mn different possible n-length subsequences. However, this theoretical maximum cannot be reached if it exceeds the size of the BTB or the pattern history table.

Agner's manual has a longer description of branch predictor in many modern CPUs and the evolution of predictor in cpus of every manufacturer (x86/x86_64).

Also a lot of theoretical "indirect branch prediction" methods (look in the Google scholar); even wiki said some words about it http://en.wikipedia.org/wiki/Branch_predictor#Prediction_of_indirect_jumps /

For Atoms from the agner's micro:

Prediction of indirect branches The Atom has no pattern predictor for indirect branches according to my tests. Indirect branches are predicted to go to the same target as last time.

So, for low power, indirect branch prediction is not so advanced. So does Via Nano:

Indirect jumps are predicted to go to the same target as last time.

I think, that shorter pipeline of lowpower x86 has lower penalty, 7-20 ticks.

这篇关于CPU体系结构的演变如何影响虚拟函数调用性能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆