现代C ++编译器的有效优化策略 [英] Effective optimization strategies on modern C++ compilers

查看:129
本文介绍了现代C ++编译器的有效优化策略的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究对性能非常关键的科学代码。代码的初始版本已经编写和测试,现在,使用分析器在手,是时候从热点开始剃须周期。

I'm working on scientific code that is very performance-critical. An initial version of the code has been written and tested, and now, with profiler in hand, it's time to start shaving cycles from the hot spots.

这是众所周知的一些优化,例如循环展开,由编译器处理这些日子比由手工插入的程序员更有效。哪些技术仍值得?显然,我会运行一切我通过一个profiler,但如果有传统的智慧,什么是工作和什么不,这将节省我大量的时间。

It's well-known that some optimizations, e.g. loop unrolling, are handled these days much more effectively by the compiler than by a programmer meddling by hand. Which techniques are still worthwhile? Obviously, I'll run everything I try through a profiler, but if there's conventional wisdom as to what tends to work and what doesn't, it would save me significant time.

我知道优化是非常依赖于编译器和架构。我使用的是英特尔的C ++编译器针对Core 2 Duo,但我也感兴趣的是什么适用于gcc或任何现代编译器。

I know that optimization is very compiler- and architecture- dependent. I'm using Intel's C++ compiler targeting the Core 2 Duo, but I'm also interested in what works well for gcc, or for "any modern compiler."

在这里是我正在考虑的一些具体的想法:

Here are some concrete ideas I'm considering:


  • 用手工替换STL容器/算法有什么好处吗?特别是,我的程序包括一个非常大的优先级队列(目前是 std :: priority_queue ),其操作占用大量的总时间。这是值得研究的,还是STL实现已经可能是最快的?

  • 类似的, std :: vector

  • 我发现动态内存分配通常是严重的瓶颈,消除它会导致显着的加速。因此,我很有兴趣在性能权衡返回大型临时数据结构的值与返回指针相比,通过引用传递结果。是否有一种方法可靠地确定编译器是否会对给定的方法使用RVO(假设调用者不需要修改结果,当然)?

  • 知道做编译器往往是?例如,是否值得研究重新排序嵌套循环?

  • 鉴于程序的科学性质,浮点数在任何地方都被使用。我的代码中的一个重要瓶颈是从浮点到整数的转换:编译器将发出代码来保存当前的舍入模式,更改它,执行转换,然后恢复旧的舍入模式 - 即使程序中没有任何东西改变了舍入模式!禁用此行为显着加快了我的代码。是否有任何类似的浮点相关问题我应该注意?

  • C ++被编译和分别链接的一个后果是,编译器无法做什么似乎是非常简单的优化,如移动方法调用像strlen()退出循环的终止条件。有没有像这样的优化,我应该注意,因为他们不能由编译器完成,必须手工完成。

  • 在有没有任何技术我应该避免,因为它们可能会干扰编译器自动优化代码的能力?

  • Is there any benefit to replacing STL containers/algorithms with hand-rolled ones? In particular, my program includes a very large priority queue (currently a std::priority_queue) whose manipulation is taking a lot of total time. Is this something worth looking into, or is the STL implementation already likely the fastest possible?
  • Along similar lines, for std::vectors whose needed sizes are unknown but have a reasonably small upper bound, is it profitable to replace them with statically-allocated arrays?
  • I've found that dynamic memory allocation is often a severe bottleneck, and that eliminating it can lead to significant speedups. As a consequence I'm interesting in the performance tradeoffs of returning large temporary data structures by value vs. returning by pointer vs. passing the result in by reference. Is there a way to reliably determine whether or not the compiler will use RVO for a given method (assuming the caller doesn't need to modify the result, of course)?
  • How cache-aware do compilers tend to be? For example, is it worth looking into reordering nested loops?
  • Given the scientific nature of the program, floating-point numbers are used everywhere. A significant bottleneck in my code used to be conversions from floating point to integers: the compiler would emit code to save the current rounding mode, change it, perform the conversion, then restore the old rounding mode --- even though nothing in the program ever changed the rounding mode! Disabling this behavior significantly sped up my code. Are there any similar floating-point-related gotchas I should be aware of?
  • One consequence of C++ being compiled and linked separately is that the compiler is unable to do what would seem to be very simple optimizations, such as move method calls like strlen() out of the termination conditions of loop. Are there any optimization like this one that I should look out for because they can't be done by the compiler and must be done by hand?
  • On the flip side, are there any techniques I should avoid because they are likely to interfere with the compiler's ability to automatically optimize code?

最后,为了在发芽中遏制某些类型的答案:

Lastly, to nip certain kinds of answers in the bud:


  • 我理解优化在复杂性方面有成本,可靠性和可维护性。对于这个特定的应用程序,提高的性能是值得的这些成本。

  • 我理解最好的优化往往是改进高级算法,这已经完成。

推荐答案


用手动滚动替换STL容器/那些?特别是,我的程序包括一个非常大的优先级队列(目前为std :: priority_queue),其操作占用大量的总时间。这是值得研究,还是STL实现已经可能是最快的可能?

Is there any benefit to replacing STL containers/algorithms with hand-rolled ones? In particular, my program includes a very large priority queue (currently a std::priority_queue) whose manipulation is taking a lot of total time. Is this something worth looking into, or is the STL implementation already likely the fastest possible?

我假设你知道STL容器依赖于复制元素。在某些情况下,这可能是一个重大的损失。存储指针,如果你做了很多容器操作,你可能会看到性能的提高。 OTOH,它可能会降低缓存位置并伤害你。另一个选择是使用专用的分配器。

I assume you're aware that the STL containers rely on copying the elements. In certain cases, this can be a significant loss. Store pointers and you may see an increase in performance if you do a lot of container manipulation. OTOH, it may reduce cache locality and hurt you. Another option is to use specialized allocators.

某些容器(例如 map / code>, list )依赖于大量的指针操作。虽然违反直觉,但它通常可以导致更快的代码,用向量替换它们。生成的算法可能从 O(1) O(log n) O (n),但由于缓存位置,它可以在实践中更快。

Certain containers (e.g. map, set, list) rely on lots of pointer manipulation. Although counterintuitive, it can often lead to faster code to replace them with vector. The resulting algorithm might go from O(1) or O(log n) to O(n), but due to cache locality it can be much faster in practice. Profile to be sure.

您提到您使用的是 priority_queue ,我认为为重新排列元素付出很多,特别是如果它们很大。您可以尝试切换底层容器(可能 deque 或专用)。

You mentioned you're using priority_queue, which I would imagine pays a lot for rearranging the elements, especially if they're large. You can try switching the underlying container (maybe deque or specialized). I'd almost certainly store pointers - again, profile to be sure.


按照类似的方式,对于需要大小的std :: vector是未知的但是具有相当小的上界,用有静态分配的数组替换它们是否有利可图?

Along similar lines, for a std::vectors whose needed sizes are unknown but have a reasonably small upper bound, is it profitable to replace them with statically-allocated arrays?

少量,取决于用例。你可以避免堆分配,但只有当你不需要你的数组超过堆栈...或者你可以 reserve()大小在 vector ,因此重新分配时复制次数较少。

Again, this may help a small amount, depending on the use case. You can avoid the heap allocation, but only if you don't need your array to outlive the stack... or you could reserve() the size in the vector so there is less copying on reallocation.


我发现动态内存分配通常是一个严重的瓶颈,消除它会导致显着的加速。因此,我很有兴趣在性能权衡返回大型临时数据结构的值与返回指针相比,通过引用传递结果。是否有一种方法可靠地确定编译器是否将为给定的方法使用RVO(假设调用者不需要修改结果,当然)?

I've found that dynamic memory allocation is often a severe bottleneck, and that eliminating it can lead to significant speedups. As a consequence I'm interesting in the performance tradeoffs of returning large temporary data structures by value vs. returning by pointer vs. passing the result in by reference. Is there a way to reliably determine whether or not the compiler will use RVO for a given method (assuming the caller doesn't need to modify the result, of course)?

您可以查看生成的程序集来查看是否应用了RVO,但如果返回指针或引用,则可以确定没有副本。是否这将有助于取决于你在做什么 - 例如。不能返回对临时值的引用。你可以使用arenas分配
和重用对象,所以不要支付大的堆罚金。

You could look at the generated assembly to see if RVO is applied, but if you return pointer or reference, you can be sure there's no copy. Whether this will help is dependent on what you're doing - e.g. can't return references to temporaries. You can use arenas to allocate and reuse objects, so not to pay a large heap penalty.


如何缓存感知编译器往往是?例如,是否值得研究重新排序嵌套循环?

How cache-aware do compilers tend to be? For example, is it worth looking into reordering nested loops?

我见过 )加速在这个领域。我看到更多的改进,从我以后看到从多线程我的代码。

I've seen dramatic (seriously dramatic) speedups in this realm. I saw more improvements from this than I later saw from multithreading my code. Things may have changed in the five years since - only one way to be sure - profile.


另一方面,有没有任何技巧我应该避免,因为它们可能会干扰编译器自动优化代码的能力?

On the flip side, are there any techniques I should avoid because they are likely to interfere with the compiler's ability to automatically optimize code?




  • 在您的单个参数构造函数中使用 explicit 。临时对象构造和销毁可能会隐藏在您的代码中。

    • Use explicit on your single argument constructors. Temporary object construction and destruction may be hidden in your code.

      注意对大对象的隐藏复制构造函数调用。在某些情况下,请考虑使用指针替换。

      Be aware of hidden copy constructor calls on large objects. In some cases, consider replacing with pointers.

      个人资料,个人资料,个人资料。调整是瓶颈的区域。

      Profile, profile, profile. Tune areas that are bottlenecks.

      这篇关于现代C ++编译器的有效优化策略的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆