编译器完成的配置文件引导的优化是否会严重损害性能分析数据集未涵盖的情况? [英] Does profile-guided optimization done by compiler notably hurt cases not covered with profiling dataset?

查看:64
本文介绍了编译器完成的配置文件引导的优化是否会严重损害性能分析数据集未涵盖的情况?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个问题不是特定于C ++的,AFAIK某些运行时(例如Java RE)可以即时进行配置文件引导的优化,我对此也很感兴趣.

This question is not specific to C++, AFAIK certain runtimes like Java RE can do profiled-guided optimization on the fly, I'm interested in that too.

MSDN 描述了PGO 这个:

  1. 我检测我的程序并在分析器中运行它,然后
  2. 编译器使用探查器收集的数据来自动重新组织分支和循环,以减少分支的错误预测,并经常压缩运行代码以提高其局部性

现在,分析结果显然取决于所使用的数据集.

Now obviously profiling result will depend on a dataset used.

使用常规的手动分析和优化,我会发现一些瓶颈并改善这些瓶颈,并可能使所有其他代码保持不变. PGO似乎可以改善经常运行的代码,但会降低很少运行的代码的速度.

With normal manual profiling and optimization I'd find some bottlenecks and improve those bottlenecks and likely leave all the other code untouched. PGO seems to improve often run code at expense of making rarely run code slower.

现在,如果减慢的代码经常在程序在现实世界中看到的另一个数据集上运行,该怎么办?与没有PGO编译的程序相比,程序性能是否会下降?下降的可能性有多严重?换句话说,PGO是否真的提高了我的性能分析数据集的代码性能,并且可能使其他数据集的性能降低了?有没有带有真实数据的真实示例?

Now what if that slowered code is run often on another dataset that the program will see in real world? Will the program performance degrade compared to a program compiled without PGO and how bad will the degradation likely be? In other word, does PGO really improve my code performance for the profiling dataset and possibly worsen it for other datasets? Are there any real examples with real data?

推荐答案

免责声明:我对PGO所做的工作不多于阅读,并通过示例项目尝试了一次.以下是基于我对非PGO"优化和有根据的猜测的经验. TL; DR在下面.

Disclaimer: I have not done more with PGO than read up on it and tried it once with a sample project for fun. A lot of the following is based on my experience iwth the "non-PGO" optimizations and educated guesses. TL;DR below.

此页面列出了PGO所做的优化.让我们一一看一下(按影响分组):

This page lists the optimizations done by PGO. Lets look at them one-by-one (grouped by impact):

内联例如,如果存在一个经常调用函数B的函数A,而函数B相对较小,则配置文件引导的优化将在函数A中内联函数B.

寄存器分配使用配置文件数据进行优化可以更好地分配寄存器.

虚拟调用推测如果虚拟调用或通过函数指针进行的其他调用经常将某个函数作为目标,则配置文件引导的优化可以将有条件执行的直接调用插入到经常被定位的函数和直接调用都可以被内联.

这些显然改善了预测,无论某些优化是否奏效.没有配置文件的代码路径没有直接的权衡.

These apparently improves the prediction whether or not some optimizations pay off. No direct tradeoff for non-profiled code paths.

基本块优化基本块优化可将在给定帧内临时执行的常用执行基本块放置在同一组页面(位置)中.这样可以最大程度地减少使用的页面数,从而最大程度地减少内存开销.

功能布局基于调用图和分析的调用者/被调用者行为,倾向于沿着相同执行路径的功能位于同一部分.

死代码分离在分析过程中未调用的代码将移至附加在部分集末尾的特殊部分.这样可以有效地将这一部分排除在常用页面之外.

EH代码分离当概要文件引导的优化可以确定异常仅在异常情况下发生时,异常执行的EH代码通常可以移到单独的部分. /em>

EH Code SeparationThe EH code, being exceptionally executed, can often be moved to a separate section when profile-guided optimizations can determine that the exceptions occur only on exceptional conditions.

所有这些可能会减少未分析的代码路径的局部性.以我的经验,如果此代码路径的紧密循环确实超过了L1代码缓存(甚至可能超过L2),那么影响将是明显的或严重的.这听起来像是应该包含在PGO配置文件中的路径:)

All of this may reduce locality of non-profiled code paths. In my experience, the impact would be noticable or severe if this code path has a tight loop that does exceed L1 code cache (and maybe even thrashes L2). That sounds exactly like a path that should have been included in a PGO profile :)

死代码分离可能会产生巨大的影响(双向影响),因为这会减少磁盘访问.

Dead Code separation can have a huge impact - both ways - because it can reduce disk access.

如果您依赖快速的异常,则说明您做错了.

If you rely on exceptions being fast, you are doing it wrong.

大小/速度优化可以将程序花费大量时间的功能进行速度优化.

当今的经验法则是默认情况下优化大小,仅在需要时优化速度(并验证它是否有帮助).原因再次是代码缓存-在大多数情况下,较小的代码也将更快代码,因为有代码缓存.因此,这种方式可以使您应该手动执行的操作自动化.与全局速度优化相比,这只会在非常不典型的情况下(怪异代码"或目标机器异常)降低非配置文件的代码路径缓存行为).

The rule of thumb nowadays is to "optimize for size by default, and only optimize for speed where needed (and verify it helps). The reason is again code cache - in most cases, the smaller code will also be the faster code, because of code cache. So this kind of automates what you should do manually. Compared to a global speed optimization, this would slow down non-profiled code paths only in very atypical cases ("weird code" or a target machine with unusual cache behavior).

条件分支优化使用值探针,配置文件引导的优化可以发现switch语句中的给定值是否比其他值更频繁地使用.然后可以将该值从switch语句中拉出.对于if/else指令,也可以执行相同的操作,其中优化程序可以对if/else进行排序,以便根据哪个块更常见,将if或else块放在第一位.

除非您输入错误的PGO信息,否则我也会将其归档为改进的预测".

I would file that under "improved prediction", too, unless you feed the wrong PGO information.

这可能会付出很多代价的典型情况是运行时参数/范围验证以及在正常执行中绝不应该采用的类似路径.

The typical case where this can pay a lot are run time parameter / range validation and similar paths that should never be taken in a normal execution.

最坏的情况是:

if (x > 0) DoThis() else DoThat();

在相关的紧闭循环中 ,并且仅分析x> 0的情况.

in a relevant tight loop and profiling only the x > 0 case.

内存内部函数如果可以确定内部函数是否被频繁调用,则可以更好地确定内部函数的扩展.还可以根据移动或复制的块大小来优化内在函数.

再次获得大多数更好的信息,很少会对未经测试的数据进行惩罚.

Again, mostly better informaiton with a small possibility of penalizing untested data.

示例:-所有这些都是有根据的猜测",但是我认为这对于整个主题来说都是说明性的.

Example: - this is all an "educated guess", but I think it's quite illustrativefor the entire topic.

假设您有一个memmove,它总是在长度为16个字节的对齐良好的非重叠缓冲区上被调用.

Assume you have a memmove that is always called on well aligned non-overlapping buffers with a length of 16 bytes.

可能的优化方法是验证这些条件,并在这种情况下使用内联的MOV指令,仅在不满足条件时才调用通用memmove(处理对齐,重叠和奇数长度).

A possible optimization is verifying these conditions and use inlined MOV instructions for this case, calling to a general memmove (handling alignment, overlap and odd length) only when the conditions are not met.

在紧密复制结构的循环中,这种好处可能非常重要,因为当您提高局部性,减少预期的路径指令时,可能会有更多的配对/重新排序的机会.

The benefits can be significant in a tight loop of copying structs around, as you improve locality, reduce expected path instruction, likely with more chances for pairing/reordering.

但是,相比而言,代价却很小:在没有PGO的一般情况下,您要么总是调用完整的memmove,要么nline完整的memmove的实现.该优化向相当复杂的内容添加了一些指令(包括条件跳转),我假设最多会产生10%的开销.在大多数情况下,由于高速缓存访​​问,这10%会低于噪音.

The penalty is comparedly small, though: in the general case without PGO, you would either always call the full memmove, or nline the full memmove implementation. The optimization adds a few instructions (including a conditional jump) to something rather complex, I'd assume a 10% overhead at most. In most cases, these 10% will be below the noise due to cache access.

但是,如果经常进行意外分支,则发生轻微影响的可能性很小. 和针对预期情况的其他说明以及针对默认情况的说明会导致紧密的循环一级代码缓存

However, there is a very slight slight chance for significant impact if the unexpected branch is taken frequently and the additional instructions for the expected case together with the instructions for the default case push a tight loop out of the L1 code cache

请注意,您已经处于编译器可以为您做的极限.与代码高速缓存中的几个K相比,可以预期这些附加指令为几个字节.静态优化器可能会遇到同样的命运,具体取决于它可以如何提升不变量-以及您允许它多少.

Note that you are already at the limits of what the compiler could do for you. The additional instructions can be expected to be a few bytes, compared to a few K in code cache. A static optimizer could hit the same fate depending on how well it can hoist invariants - and how much you let it.

结论:

  • 许多优化都是中性的.
  • 某些优化可能会对未配置的代码路径产生轻微的负面影响
  • 影响通常比可能产生的影响小
  • 非常罕见,其他病理因素也可以强调很小的影响
  • 很少有优化(即代码段的布局)会产生很大的影响,但可能的收益再次大大超过了

我的直觉会进一步宣称

  • 总体而言,静态优化器至少有可能造成病理情况
  • 即使在PGO输入不正确的情况下,要真正破坏的性能也会非常困难.
  • A static optimizer, on a whole, would be at least equally likely to create a pathological case
  • It would be pretty hard to actually destroy performance even with bad PGO input.

在那个级别上,我会更担心PGO 实现的错误/缺点,而不是失败的PGO优化.

At that level, I would be much more afraid of PGO implementation bugs/shortcomings than of failed PGO optimizations.

这篇关于编译器完成的配置文件引导的优化是否会严重损害性能分析数据集未涵盖的情况?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆