如何让GCC完全展开这个循环（即剥离这个循环）？ [英] How to ask GCC to completely unroll this loop (i.e., peel this loop)?

查看：1697 发布时间：2016/8/18 14:51:30 c loops gcc hpc loop-unrolling

本文介绍了如何让GCC完全展开这个循环（即剥离这个循环）？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

有没有一种方法来指示GCC（版本我用4.8.4），以展开在底层函数的完全，即剥离这个循环while循环？循环的迭代数目在编译时已知：58

Is there a way to instruct GCC (version I used 4.8.4) to unroll the while loop in the bottom function completely, i.e., peel this loop? The number of iterations of the loop is known at compilation time: 58.

让我先解释什么，我都试过了。

Let me first explain what I have tried.

通过检查GAS输出中：

By checking GAS ouput:

gcc -fpic -O2 -S GEPDOT.c

12寄存器XMM0 - XMM11被使用。如果我通过了国旗的 -funroll-循环以GCC：

gcc -fpic -O2 -funroll-loops -S GEPDOT.c

环路仅展开两次。我查了GCC优化选项。 GCC说， -funroll-循环会打开 -frename寄存器为好，所以当GCC解开一个循环，其先前的寄存器分配的选择是使用遗留寄存器。但也有4只遗留XMM12 - XMM15，所以GCC只能在其最好解开的2倍。曾是48而不是16 XMM寄存器可用那里，GCC将展开while循环4次无故障。

the loop is only unrolled two times. I checked the GCC optimization options. GCC says that -funroll-loops will turn on -frename-registers as well, so when GCC unrolls a loop, its prior choice for register allocation is to use "left over" registers. But there are only 4 left over XMM12 - XMM15, so GCC can only unroll 2 times at its best. Had there been 48 instead of 16 XMM registers available, GCC will unroll the while loop 4 times without trouble.

不过，我做了另一个实验。我先手动展开while循环两次，获得功能GEPDOT_2。再有就是否

Yet I did another experiment. I first unrolled the while loop two time manually, obtaining a function GEPDOT_2. Then there is no difference at all between

gcc -fpic -O2 -S GEPDOT_2.c

和

gcc -fpic -O2 -funroll-loops -S GEPDOT_2.c

由于GEPDOT_2已经用完了所有的寄存器，不展开执行。

Since GEPDOT_2 already used up all registers, no unrolling is performed.

GCC没有寄存器重命名，以避免潜力假依赖引进。但我知道肯定会有我的GEPDOT没有这样的潜力;即使有，这并不重要。我试图展开循环自己，展开4次超过2展开倍，比不展开快了快了。当然，我可以手动展开多次，但它是单调乏味的。 GCC能为我做到这一点？谢谢你。

GCC does register renaming to avoid potential false dependency introduced. But I know for sure that there will be no such potential in my GEPDOT; even if there is, it is not important. I tried unrolling the loop myself, and unrolling 4 times is faster than unrolling 2 times, faster than no unrolling. Of course I can manually unroll more times, but it is tedious. Can GCC do this for me? Thanks.

// C file "GEPDOT.c" #include <emmintrin.h> void GEPDOT (double *A, double *B, double *C) { __m128d A1_vec = _mm_load_pd(A); A += 2; __m128d B_vec = _mm_load1_pd(B); B++; __m128d C1_vec = A1_vec * B_vec; __m128d A2_vec = _mm_load_pd(A); A += 2; __m128d C2_vec = A2_vec * B_vec; B_vec = _mm_load1_pd(B); B++; __m128d C3_vec = A1_vec * B_vec; __m128d C4_vec = A2_vec * B_vec; B_vec = _mm_load1_pd(B); B++; __m128d C5_vec = A1_vec * B_vec; __m128d C6_vec = A2_vec * B_vec; B_vec = _mm_load1_pd(B); B++; __m128d C7_vec = A1_vec * B_vec; A1_vec = _mm_load_pd(A); A += 2; __m128d C8_vec = A2_vec * B_vec; B_vec = _mm_load1_pd(B); B++; int k = 58; /* can compiler unroll the loop completely (i.e., peel this loop)? */ while (k--) { C1_vec += A1_vec * B_vec; A2_vec = _mm_load_pd(A); A += 2; C2_vec += A2_vec * B_vec; B_vec = _mm_load1_pd(B); B++; C3_vec += A1_vec * B_vec; C4_vec += A2_vec * B_vec; B_vec = _mm_load1_pd(B); B++; C5_vec += A1_vec * B_vec; C6_vec += A2_vec * B_vec; B_vec = _mm_load1_pd(B); B++; C7_vec += A1_vec * B_vec; A1_vec = _mm_load_pd(A); A += 2; C8_vec += A2_vec * B_vec; B_vec = _mm_load1_pd(B); B++; } C1_vec += A1_vec * B_vec; A2_vec = _mm_load_pd(A); C2_vec += A2_vec * B_vec; B_vec = _mm_load1_pd(B); B++; C3_vec += A1_vec * B_vec; C4_vec += A2_vec * B_vec; B_vec = _mm_load1_pd(B); B++; C5_vec += A1_vec * B_vec; C6_vec += A2_vec * B_vec; B_vec = _mm_load1_pd(B); C7_vec += A1_vec * B_vec; C8_vec += A2_vec * B_vec; /* [write-back] */ A1_vec = _mm_load_pd(C); C1_vec = A1_vec - C1_vec; A2_vec = _mm_load_pd(C + 2); C2_vec = A2_vec - C2_vec; A1_vec = _mm_load_pd(C + 4); C3_vec = A1_vec - C3_vec; A2_vec = _mm_load_pd(C + 6); C4_vec = A2_vec - C4_vec; A1_vec = _mm_load_pd(C + 8); C5_vec = A1_vec - C5_vec; A2_vec = _mm_load_pd(C + 10); C6_vec = A2_vec - C6_vec; A1_vec = _mm_load_pd(C + 12); C7_vec = A1_vec - C7_vec; A2_vec = _mm_load_pd(C + 14); C8_vec = A2_vec - C8_vec; _mm_store_pd(C,C1_vec); _mm_store_pd(C + 2,C2_vec); _mm_store_pd(C + 4,C3_vec); _mm_store_pd(C + 6,C4_vec); _mm_store_pd(C + 8,C5_vec); _mm_store_pd(C + 10,C6_vec); _mm_store_pd(C + 12,C7_vec); _mm_store_pd(C + 14,C8_vec); }

更新1

多亏了@ user3386109的评论，我想延长这个问题一点点。 @ user3386109提出了一个很好的问题。其实我对编译器的最佳寄存器分配能力有些怀疑，当有这么多的并行指令来安排。

Thanks to the comment by @user3386109, I would like to extend this question a little bit. @user3386109 raises a very good question. Actually I do have some doubt on compiler's ability for optimal register allocation, when there are so many parallel instructions to schedule.

我个人认为，一个可靠的方法是先code中的循环体（这是关键HPC）在 ASM 内联汇编，然后复制它很多次，因为我想。今年早些时候我有一个不受欢迎的帖子：<一href=\"http://stackoverflow.com/questions/35189619/inline-assembly-in-c-assembler-messages-error-unknown-pseudo-op\">inline装配。在code是一个有点不同，因为循环迭代，j的数量，是一个函数参数在编译时未知的，因此。在这种情况下，我不能完全展开循环，所以我只重复大会code两次，并转换环路成一个标签和跳跃。原来，生成的我的书面装配的性能比生成的汇编编译器高出约5％，这可能表明，编译器无法在我们的预期，最佳的方式分配寄存器。

I personally think that a reliable way is to first code the loop body (which is key to HPC) in asm inline assembly, then duplicate it as many times as I want. I had an unpopular post earlier this year: inline assembly. The code was a little different because the number of loop iterations, j, is a function argument hence unknown at compilation time. In that case I can not fully unroll the loop, so I only duplicated the assembly code twice, and converted the loop into a label and jump. It turned out that the resulting performance of my written assembly is about 5% higher than compiler generated assembly, which might suggest that compiler fails to allocate registers in our expected, optimal manner.

我是（现在也还是）组装编码一个宝宝，所以，供应良好的案例研究，我学习上的x86汇编一点点。但是从长远来看，我不倾向于code GEPDOT用大比例进行组装。主要有三个原因：

I was (and am still) a baby in assembly coding, so that serves a good case study for me to learn a little bit on x86 assembly. But in a long run I do not incline to code GEPDOT with a big proportion for assembly. There are mainly three reasons:

ASM 内联汇编已经critisized不被移植。虽然我不明白为什么。也许是因为不同的机器有不同的寄存器重挫？

编译器也渐入佳境。所以，我仍然会preFER算法优化和更好的C编码习惯，以帮助编译器产生良好的输出;

最后一个原因是更重要的。迭代次数可能并不总是58.我开发高性能的矩阵分解子程序。对于一个高速缓存块因子的 NB 的，迭代的次数会的（NB-2）的。我不会把的 NB 的作为函数参数，正如我在前面的帖子一样。这是一个机器特定参数将被定义为宏。这样的迭代次数是在编译时已知的，但也可以从机器到机器不同而不同。猜猜我有多繁琐的工作在手动循环展开了各种的 NB 的做。所以，如果有一种方法可以简单地指示编译器剥离一个循环，这是伟大的。

asm inline assembly has been critisized for not being portable. Though I don't understand why. Perhaps because different machines have different registers clobbered?

Compiler is also getting better. So I would still prefer to algorithmic optimization and better C coding habit to assist compiler in generating good output;

The last reason is more important. The number of iterations may not always be 58. I am developing a high performance matrix factorization subroutine. For a cache blocking factor nb, the number of iterations would be (nb-2). I am not going to put nb as a function argument, as I did in the earlier post. This is a machine specific parameter will be defined as a macro. So the number of iterations is known at compiled time, but may vary from machine to machine. Guess how much tedious work I have to do in manual loop unrolling for a variety of nb. So if there is a way to simply instruct the compiler to peel a loop, that is great.

我会非常AP preciated，如果你也能生产高性能，便携库分享一些经验。

I would be very appreciated if you can also share some experience in producing high performance, yet portable library.

推荐答案

尝试调整优化参数：

gcc -O3 -funroll-loops --param max-completely-peeled-insns=1000 --param max-completely-peel-times=100

这应该做的伎俩。

这篇关于如何让GCC完全展开这个循环（即剥离这个循环）？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何让GCC完全展开这个循环（即剥离这个循环）？ [英] How to ask GCC to completely unroll this loop (i.e., peel this loop)?

问题描述

推荐答案

相关文章

C/C++最新文章

热门教程

热门工具

登录关闭

如何让GCC完全展开这个循环（即剥离这个循环）？ [英] How to ask GCC to completely unroll this loop (i.e., peel this loop)?

问题描述

推荐答案

相关文章

C/C++最新文章

热门教程

热门工具

登录 关闭

登录关闭