超越-O3 / -Ofast的G ++优化 [英] G++ optimization beyond -O3/-Ofast

查看:135
本文介绍了超越-O3 / -Ofast的G ++优化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题

我们有一个中型程序用于模拟任务,需要对其进行优化。我们已经尽最大努力优化了源代码,使其达到了编程技能的极限,包括使用 Gprof 进行分析和 Valgrind

We have a mid-sized program for a simulation task, that we need to optimize. We have already done our best optimizing the source to the limit of our programming skills, including profiling with Gprof and Valgrind.

最终完成时,我们希望在多个系统上运行该程序大约几个月。因此,我们非常有兴趣将优化推向极限。

When finally finished, we want to run the program on several systems probably for some months. Therefore we are really interested in pushing the optimization to the limits.

所有系统都将在相对较新的硬件(Intel i5或i7)上运行Debian / Linux。

All systems will run Debian/Linux on relatively new hardware (Intel i5 or i7).

问题

使用最近的优化选项有哪些g ++的版本,超出了-O3 / -Ofast吗?

我们还对代价高昂的次要优化感兴趣,从长远来看,这会有所回报。

We are also interested in costly minor optimization, that will payout in the long run.

我们现在使用的是

现在我们使用以下g ++优化选项:

Right now we use the following g++ optimization options:


  • -Ofast :最高标准优化级别。包含的 -ffast-math 不会对我们的计算造成任何问题,因此尽管不符合标准,我们还是决定继续使用。

  • -march = native :启用所有特定于CPU的指令。

  • - flto 允许跨不同的编译单元优化链接时间。

  • -Ofast: Highest "standard" optimization level. The included -ffast-math did not cause any problems in our calculations, so we decided to go for it, despite of the non standard-compliance.
  • -march=native: Enabling the use of all CPU specific instructions.
  • -flto to allow link time optimization, across different compilation units.

推荐答案

<大多数答案都提出了替代解决方案,例如不同的编译器或外部库,这很可能会带来大量的重写或集成工作。我将尝试坚持问题的实质,并按照OP的要求,通过激活编译器标志或对代码进行最少的更改来专注于单独使用GCC可以完成的工作。这不是一个必须做的事情的答案,而是更多的GCC调整的集合,这些调整对我来说效果很好,如果在您的特定情况下相关,您可以尝试一下。

Most of the answers suggest alternative solutions, such as different compilers or external libraries, which would most likely bring a lot of rewriting or integration work. I will try to stick to what the question is asking, and focus on what can be done with GCC alone, by activating compiler flags or doing minimal changes to the code, as requested by the OP. This is not a "you must do this" answer, but more a collection of GCC tweaks that have worked out well for me and that you can give a try if they are relevant in your specific context.

有关原始问题的警告

在进入详细信息之前,一些关于该问题的警告,通常是对于那些会来的人,阅读该问题并说 OP正在O3之外进行优化,我应该使用与他相同的标志!。

Before going into the details, a few warning regarding the question, typically for people who will come along, read the question and say "the OP is optimising beyond O3, I should use the same flags than he does!".


  • -march = native 允许使用特定于给定CPU体系结构的指令,不一定适用于其他体系结构。如果该程序在具有不同CPU的系统上运行,则可能根本无法运行,或者运行速度明显较慢(因为这也会启用 mtune = native ),因此请注意您决定使用它。更多信息此处
  • $如您所说,b $ b
  • -Ofast 启用了一些非标准兼容优化,因此也应谨慎使用。详细信息此处

  • -march=native enables usage of instructions specific to a given CPU architecture, and that are not necessarily available on a different architecture. The program may not work at all if run on a system with a different CPU, or be significantly slower (as this also enables mtune=native), so be aware of this if you decide to use it. More information here.
  • -Ofast, as you stated, enables some non standard compliant optimisations, so it should used with caution as well. More information here.

要尝试的其他GCC标志

此处中列出。


  • -Ofast 启用 -ffast-math ,依次启用 -fno-math-errno -funsafe-math优化-仅有限位数 -fno-rounding-math -fno-signaling-nans -fcx限制范围。通过有选择地添加一些额外标志(例如 -fno-signed-zeros 浮点计算优化 >, -fno-trapping-math 等。这些未包含在 -Ofast 中,它们可以使计算性能得到一些额外的提高,但是您必须检查它们是否真正使您受益,并且不要中断任何计算。

  • GCC还具有大量的其他优化标志,而这些标志没有被任何 -O选项启用。它们被列为可能产生破损代码的实验选项,因此,再次使用它们时应谨慎,并通过测试正确性和基准来检查其效果。不过,我确实经常使用 -frename-registers ,此选项从未对我产生过不希望的结果,并且往往会带来明显的性能提升(即可以在基准测试时进行测量) 。但是,这是标志的类型,它非常取决于您的处理器。 -funroll-loops 有时也会产生良好的结果(也暗示 -frename-registers ),但这取决于您的实际代码。

  • -Ofast enables -ffast-math, which in turn enables -fno-math-errno, -funsafe-math-optimizations, -ffinite-math-only, -fno-rounding-math, -fno-signaling-nans and -fcx-limited-range. You can go even further on floating point calculation optimisations by selectively adding some extra flags such as -fno-signed-zeros, -fno-trapping-math and others. These are not included in -Ofast and can give some additional performance increases on calculations, but you must check whether they actually benefit you and don't break any calculations.
  • GCC also features a large amount of other optimisation flags which aren't enabled by any "-O" options. They are listed as "experimental options that may produce broken code", so again, they should be used with caution, and their effects checked both by testing for correctness and benchmarking. Nevertheless, I do often use -frename-registers, this option has never produced unwanted results for me and tends to give a noticeable performance increase (ie. can be measured when benchmarking). This is the type of flag that is very dependant on your processor though. -funroll-loops also sometimes gives good results (and also implies -frename-registers), but it is dependent on your actual code.

PGO

GCC具有配置文件引导的优化功能。没有很多精确的GCC文档,但是运行起来非常简单。

GCC has Profile-Guided Optimisations features. There isn't a lot of precise GCC documentation about it, but nevertheless getting it to run is quite straightforward.


  • 首先使用以下命令编译程序 -fprofile-generate

  • 让程序运行(执行时间将大大减慢,因为代码还将生成配置文件信息.gcda文件)。

  • 使用 -fprofile-use 重新编译程序。如果您的应用程序是多线程的,还添加 -fprofile-correction 标志。

  • first compile your program with -fprofile-generate.
  • let the program run (the execution time will be significantly slower as the code is also generating profile information into .gcda files).
  • recompile the program with -fprofile-use. If your application is multi-threaded also add the -fprofile-correction flag.

带有GCC的PGO可以产生惊人的结果,并可以显着提高性能(我最近看到的一个项目的速度提高了15-20%)。显然,这里的问题是要有一些数据足以代表应用程序的执行,但这些数据并不总是可用或容易获得。

PGO with GCC can give amazing results and really significantly boost performance (I've seen a 15-20% speed increase on one of the projects I was recently working on). Obviously the issue here is to have some data that is sufficiently representative of your application's execution, which is not always available or easy to obtain.

GCC的并行模式

GCC具有 Parallel Mode ,该模式在GCC 4.2编译器左右首次发布

GCC features a Parallel Mode, which was first released around the time where the GCC 4.2 compiler was out.

基本上,它为您提供了C ++标准库中许多算法的并行实现。要在全局范围内启用它们,只需将 -fopenmp -D_GLIBCXX_PARALLEL 标志添加到编译器。您还可以在需要时有选择地启用每个算法,但这将需要一些小的代码更改。

Basically, it provides you with parallel implementations of many of the algorithms in the C++ Standard Library. To enable them globally, you just have to add the -fopenmp and the -D_GLIBCXX_PARALLEL flags to the compiler. You can also selectively enable each algorithm when needed, but this will require some minor code changes.

有关此并行模式的所有信息都可以找到此处

All the information about this parallel mode can be found here.

如果您经常使用这些算法适用于大型数据结构,并且具有许多可用的硬件线程上下文,这些并行实现可以极大地提高性能。到目前为止,我只使用了 sort 的并行实现,但是给出一个大概的想法,我设法将其中一个排序时间从14秒减少到4秒应用程序(测试环境:具有自定义比较器功能和8核计算机的1亿个对象的向量)。

If you frequently use these algorithms on large data structures, and have many hardware thread contexts available, these parallel implementations can give a huge performance boost. I have only made use of the parallel implementation of sort so far, but to give a rough idea I managed to reduce the time for sorting from 14 to 4 seconds in one of my applications (testing environment: vector of 100 millions objects with custom comparator function and 8 cores machine).

其他技巧

与之前的要点部分不同,该部分确实需要对代码进行一些小的更改。它们也是GCC特定的(其中一些也可以在Clang上运行),因此应使用编译时宏来使代码在其他编译器上可移植。本节包含一些更高级的技术,如果您对组装过程不了解,请不要使用。还要注意,处理器和编译器现在很聪明,因此要从此处描述的功能中获得任何明显的好处可能很棘手。

Unlike the previous points sections, this part does require some small changes in the code. They are also GCC specific (some of them work on Clang as well), so compile time macros should be used to keep the code portable on other compilers. This section contains some more advanced techniques, and should not be used if you don't have some assembly level understanding of what's going on. Also note that processors and compilers are pretty smart nowadays, so it may be tricky to get any noticeable benefit from the functions described here.


  • GCC内置插件,在此处中列出。 __ builtin_expect 之类的构造可以通过向编译器提供分支预测信息来帮助编译器进行更好的优化。其他构造如 __ builtin_prefetch 将数据带入缓存,然后再访问它,并有助于减少缓存未命中

  • 函数属性,在此处中列出。特别是,您应该查看 hot cold 属性;前者将向编译器指示该功能是程序的热点,并会更积极地优化该功能并将其放置在text部分的特殊小节中,以实现更好的局部性;

  • GCC builtins, which are listed here. Constructs such as __builtin_expect can help the compiler do better optimisations by providing it with branch prediction information. Other constructs such as __builtin_prefetch brings data into a cache before it is accessed and can help reducing cache misses.
  • function attributes, which are listed here. In particular, you should look into the hot and cold attributes; the former will indicate to the compiler that the function is a hotspot of the program and optimise the function more aggressively and place it in a special subsection of the text section, for better locality; the later will optimise the function for size and place it in another special subsection of the text section.

我希望这个答案对某些开发人员有用,并且我很乐意考虑任何修改或建议。

I hope this answer will prove useful for some developers, and I will be glad to consider any edits or suggestions.

这篇关于超越-O3 / -Ofast的G ++优化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆