是gcc循环展开国旗真的有效吗? [英] Is GCC loop unrolling flag really effective?

查看:149
本文介绍了是gcc循环展开国旗真的有效吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在C,我有一个任务,我必须做乘法,倒置,trasposition,此外等等等等是庞大分配为2维数组矩阵(数组的数组)。

In C, I have a task where I must do multiplication, inversion, trasposition, addition etc. etc. with huge matrices allocated as 2-dimensional arrays, (arrays of arrays).

我已经找到了gcc标志 -funroll-全循环。如果我理解正确的话,这将自动解开所有回路,而不由程序员的任何努力。

I have found the gcc flag -funroll-all-loops. If I understand correctly, this will unroll all loops automatically without any efforts by the programmer.

我的问题:

GCC是否包含这类的各种优化标志优化为 -O1 -O2 等。

a) Does gcc include this kind of optimization with the various optimization flags as -O1, -O2 etc.?

B)我必须使用任何编译是我的code内利用循环展开或正在循环自动识别?

b) Do I have to use any pragmas inside my code to take advantage of loop unrolling or are loops identified automatically?

C)为什么这个选项默认不启用,如果展开提高性能?​​

c) Why is this option not enabled by default if the unrolling increases the performance?

D)什么是推荐GCC的优化参数来编译最好的方式的计划? (我必须运行此程序为单个CPU系列进行了优化,这是一样的,我编译code的机器,其实我用行军=本地 -O2 标志)

d) What are the recommended gcc optimization flags to compile the program in the best way possible? (I must run this program optimized for a single CPU family, that is the same of the machine where I compile the code, actually I use march=native and -O2 flags)

修改

似乎有关于使用的解开,在某些情况下,可能的性能减慢controversities。在我的情况下,有一些做的只是嵌套为一个巨大的元件的数量,做迭代矩阵元素周期2数学运算的各种方法。在这种情况下如何解开能减慢或提高性能?​​

Seems that there are controversities about the use of unroll that in some cases may slow down the performance. In my situations there are various methods that do simply math operations in 2 nested for cycles for iterate matrix elements done for an huge amount of elements. In this scenario how unroll could slow down or increase the performance?

推荐答案

现代处理器流水线指令。他们喜欢知道接下来会发生什么,并做出种种假设为基础的,其中责令指令应该被执行看中的优化的。

Why unroll loops?

Modern processors pipeline instructions. They like knowing what's coming next and make all sorts of fancy optimisations based on assumptions of which order the instructions should be executed.

目前一个循环结束虽然,有两种可能性!要么你回去顶端,或者​​继续。该处理器使在其上会发生一个受过教育的猜测。如果它获得它的权利,一切都很好。如果不是,它有冲洗管道,摊位位,而它prepares采取其他分支。

At the end of a loop though, there are two possibilities! Either you go back to the top, or continue on. The processor makes an educated guess on which is going to happen. If it gets it right, everything is good. If not, it has to flush the pipeline and stall for a bit while it prepares for taking the other branch.

你可以想像,展开循环消除了分支机构,并为那些摊位的潜力,尤其是在凶多吉少猜测的情况。

As you can imagine, unrolling a loop eliminates branches and the potential for those stalls, especially in cases where the odds are against a guess.

想象一下,code,它执行3次的循环,然后继续。如果假设(如处理器可能会),在最后,你会重复循环。的2/3时,你会是正确的!虽然1/3的时间,你会熄火。

Imagine a loop of code that executes 3 times, then continues. If you assume (as the processor probably would) that at the end you'll repeat the loop. 2/3 of the time, you'll be correct! 1/3 of the time though, you'll stall.

在另一方面,想象同样的情况,但code循环3000次。在这里,有大概只有从展开的时间增益1/3000。

On the other hand, imagine the same situation, but the code loops 3000 times. Here, there's probably only a gain 1/3000 of the time from unrolling.

上述处理器空想的一部分涉及装载在存储器中的可执行程序的指令在处理器中的板载指令缓存(缩短到I高速缓冲存储器)。这适用,其可以快速地访问的指令的数量有限,但是可能会停止,当新的指令需要从存储器加载

Part of the processor fanciness mentioned above involves loading the instructions from the executable in memory into the processor's onboard instruction cache (shortened to I-cache). This holds a limited amount of instructions which can be accessed quickly, but may stall when new instructions need to be loaded from memory.

让我们回到previous例子。假设code的相当少量的内环路占用I-cache中的 N 字节。如果我们展开循环,它现在占用 N * 3 字节。多一点,但它可能会适合在一个单一的高速缓存行就好使您的缓存将被优化工作,不需要来搪塞从主内存中读取。

Let's go back to the previous examples. Assume a reasonably small amount of code inside the loop takes up n bytes of I-cache. If we unroll the loop, it's now taking up n * 3 bytes. A bit more, but it'll probably fit in a single cache line just fine so your cache will be working optimally and not needing to stall reading from main memory.

3000环,然而,解开使用高达I-cache中的 N * 3000 字节。那将需要数从内存读取,并且可能是由其他地方的计划推其他一些有用的东西了I-cache中的。

The 3000-loop, however, unrolls to use a whopping n * 3000 bytes of I-cache. That's going to require several reads from memory, and probably push some other useful stuff from elsewhere in the program out of the I-cache.

如你所见,展开提供了更短的循环更多的利益,但最终捣毁的表现,如果你打算循环了大量的时间。

As you can see, unrolling provides more benefits for shorter loops but ends up trashing performance if you're intending to loop a large number of times.

一般情况下,一个聪明的编译器会采取一个体面的猜测哪些循环解开,但你可以强制它,如果你的确定的你知道更好。你怎么知道好?唯一的办法是尝试这两种方法,比较计时!

Usually, a smart compiler will take a decent guess about which loops to unroll but you can force it if you're sure you know better. How do you get to know better? The only way is to try it both ways and compare timings!

premature优化是一切罪恶的根源的 - 高德纳

个人资料的,后来优化。

这篇关于是gcc循环展开国旗真的有效吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆