如何让GCC完全展开这个循环(即剥离这个循环)? [英] How to ask GCC to completely unroll this loop (i.e., peel this loop)?
问题描述
有没有一种方法来指示GCC(版本我用4.8.4),以展开在底层函数的完全,即剥离这个循环while循环?循环的迭代数目在编译时已知:58
Is there a way to instruct GCC (version I used 4.8.4) to unroll the while loop in the bottom function completely, i.e., peel this loop? The number of iterations of the loop is known at compilation time: 58.
让我先解释什么,我都试过了。
Let me first explain what I have tried.
通过检查GAS输出中:
By checking GAS ouput:
gcc -fpic -O2 -S GEPDOT.c
12寄存器XMM0 - XMM11被使用。如果我通过了国旗的 -funroll-循环以GCC:
gcc -fpic -O2 -funroll-loops -S GEPDOT.c
环路仅展开两次。我查了GCC优化选项。 GCC说, -funroll-循环会打开 -frename寄存器为好,所以当GCC解开一个循环,其先前的寄存器分配的选择是使用遗留寄存器。但也有4只遗留XMM12 - XMM15,所以GCC只能在其最好解开的2倍。曾是48而不是16 XMM寄存器可用那里,GCC将展开while循环4次无故障。
the loop is only unrolled two times. I checked the GCC optimization options. GCC says that -funroll-loops will turn on -frename-registers as well, so when GCC unrolls a loop, its prior choice for register allocation is to use "left over" registers. But there are only 4 left over XMM12 - XMM15, so GCC can only unroll 2 times at its best. Had there been 48 instead of 16 XMM registers available, GCC will unroll the while loop 4 times without trouble.
不过,我做了另一个实验。我先手动展开while循环两次,获得功能GEPDOT_2。再有就是否
Yet I did another experiment. I first unrolled the while loop two time manually, obtaining a function GEPDOT_2. Then there is no difference at all between
gcc -fpic -O2 -S GEPDOT_2.c
和
gcc -fpic -O2 -funroll-loops -S GEPDOT_2.c
由于GEPDOT_2已经用完了所有的寄存器,不展开执行。
Since GEPDOT_2 already used up all registers, no unrolling is performed.
GCC没有寄存器重命名,以避免潜力假依赖引进。但我知道肯定会有我的GEPDOT没有这样的潜力;即使有,这并不重要。我试图展开循环自己,展开4次超过2展开倍,比不展开快了快了。当然,我可以手动展开多次,但它是单调乏味的。 GCC能为我做到这一点?谢谢你。
GCC does register renaming to avoid potential false dependency introduced. But I know for sure that there will be no such potential in my GEPDOT; even if there is, it is not important. I tried unrolling the loop myself, and unrolling 4 times is faster than unrolling 2 times, faster than no unrolling. Of course I can manually unroll more times, but it is tedious. Can GCC do this for me? Thanks.
// C file "GEPDOT.c"
#include <emmintrin.h>
void GEPDOT (double *A, double *B, double *C) {
__m128d A1_vec = _mm_load_pd(A); A += 2;
__m128d B_vec = _mm_load1_pd(B); B++;
__m128d C1_vec = A1_vec * B_vec;
__m128d A2_vec = _mm_load_pd(A); A += 2;
__m128d C2_vec = A2_vec * B_vec;
B_vec = _mm_load1_pd(B); B++;
__m128d C3_vec = A1_vec * B_vec;
__m128d C4_vec = A2_vec * B_vec;
B_vec = _mm_load1_pd(B); B++;
__m128d C5_vec = A1_vec * B_vec;
__m128d C6_vec = A2_vec * B_vec;
B_vec = _mm_load1_pd(B); B++;
__m128d C7_vec = A1_vec * B_vec;
A1_vec = _mm_load_pd(A); A += 2;
__m128d C8_vec = A2_vec * B_vec;
B_vec = _mm_load1_pd(B); B++;
int k = 58;
/* can compiler unroll the loop completely (i.e., peel this loop)? */
while (k--) {
C1_vec += A1_vec * B_vec;
A2_vec = _mm_load_pd(A); A += 2;
C2_vec += A2_vec * B_vec;
B_vec = _mm_load1_pd(B); B++;
C3_vec += A1_vec * B_vec;
C4_vec += A2_vec * B_vec;
B_vec = _mm_load1_pd(B); B++;
C5_vec += A1_vec * B_vec;
C6_vec += A2_vec * B_vec;
B_vec = _mm_load1_pd(B); B++;
C7_vec += A1_vec * B_vec;
A1_vec = _mm_load_pd(A); A += 2;
C8_vec += A2_vec * B_vec;
B_vec = _mm_load1_pd(B); B++;
}
C1_vec += A1_vec * B_vec;
A2_vec = _mm_load_pd(A);
C2_vec += A2_vec * B_vec;
B_vec = _mm_load1_pd(B); B++;
C3_vec += A1_vec * B_vec;
C4_vec += A2_vec * B_vec;
B_vec = _mm_load1_pd(B); B++;
C5_vec += A1_vec * B_vec;
C6_vec += A2_vec * B_vec;
B_vec = _mm_load1_pd(B);
C7_vec += A1_vec * B_vec;
C8_vec += A2_vec * B_vec;
/* [write-back] */
A1_vec = _mm_load_pd(C); C1_vec = A1_vec - C1_vec;
A2_vec = _mm_load_pd(C + 2); C2_vec = A2_vec - C2_vec;
A1_vec = _mm_load_pd(C + 4); C3_vec = A1_vec - C3_vec;
A2_vec = _mm_load_pd(C + 6); C4_vec = A2_vec - C4_vec;
A1_vec = _mm_load_pd(C + 8); C5_vec = A1_vec - C5_vec;
A2_vec = _mm_load_pd(C + 10); C6_vec = A2_vec - C6_vec;
A1_vec = _mm_load_pd(C + 12); C7_vec = A1_vec - C7_vec;
A2_vec = _mm_load_pd(C + 14); C8_vec = A2_vec - C8_vec;
_mm_store_pd(C,C1_vec); _mm_store_pd(C + 2,C2_vec);
_mm_store_pd(C + 4,C3_vec); _mm_store_pd(C + 6,C4_vec);
_mm_store_pd(C + 8,C5_vec); _mm_store_pd(C + 10,C6_vec);
_mm_store_pd(C + 12,C7_vec); _mm_store_pd(C + 14,C8_vec);
}
更新1
多亏了@ user3386109的评论,我想延长这个问题一点点。 @ user3386109提出了一个很好的问题。其实我对编译器的最佳寄存器分配能力有些怀疑,当有这么多的并行指令来安排。
Thanks to the comment by @user3386109, I would like to extend this question a little bit. @user3386109 raises a very good question. Actually I do have some doubt on compiler's ability for optimal register allocation, when there are so many parallel instructions to schedule.
我个人认为,一个可靠的方法是先code中的循环体(这是关键HPC)在 ASM 内联汇编,然后复制它很多次,因为我想。今年早些时候我有一个不受欢迎的帖子:<一href=\"http://stackoverflow.com/questions/35189619/inline-assembly-in-c-assembler-messages-error-unknown-pseudo-op\">inline装配。在code是一个有点不同,因为循环迭代,j的数量,是一个函数参数在编译时未知的,因此。在这种情况下,我不能完全展开循环,所以我只重复大会code两次,并转换环路成一个标签和跳跃。原来,生成的我的书面装配的性能比生成的汇编编译器高出约5%,这可能表明,编译器无法在我们的预期,最佳的方式分配寄存器。
I personally think that a reliable way is to first code the loop body (which is key to HPC) in asm inline assembly, then duplicate it as many times as I want. I had an unpopular post earlier this year: inline assembly. The code was a little different because the number of loop iterations, j, is a function argument hence unknown at compilation time. In that case I can not fully unroll the loop, so I only duplicated the assembly code twice, and converted the loop into a label and jump. It turned out that the resulting performance of my written assembly is about 5% higher than compiler generated assembly, which might suggest that compiler fails to allocate registers in our expected, optimal manner.
我是(现在也还是)组装编码一个宝宝,所以,供应良好的案例研究,我学习上的x86汇编一点点。但是从长远来看,我不倾向于code GEPDOT用大比例进行组装。主要有三个原因:
I was (and am still) a baby in assembly coding, so that serves a good case study for me to learn a little bit on x86 assembly. But in a long run I do not incline to code GEPDOT with a big proportion for assembly. There are mainly three reasons:
- ASM 内联汇编已经critisized不被移植。虽然我不明白为什么。也许是因为不同的机器有不同的寄存器重挫?
- 编译器也渐入佳境。所以,我仍然会preFER算法优化和更好的C编码习惯,以帮助编译器产生良好的输出;
- 最后一个原因是更重要的。迭代次数可能并不总是58.我开发高性能的矩阵分解子程序。对于一个高速缓存块因子的 NB 的,迭代的次数会的(NB-2)的。我不会把的 NB 的作为函数参数,正如我在前面的帖子一样。这是一个机器特定参数将被定义为宏。这样的迭代次数是在编译时已知的,但也可以从机器到机器不同而不同。猜猜我有多繁琐的工作在手动循环展开了各种的 NB 的做。所以,如果有一种方法可以简单地指示编译器剥离一个循环,这是伟大的。
- asm inline assembly has been critisized for not being portable. Though I don't understand why. Perhaps because different machines have different registers clobbered?
- Compiler is also getting better. So I would still prefer to algorithmic optimization and better C coding habit to assist compiler in generating good output;
- The last reason is more important. The number of iterations may not always be 58. I am developing a high performance matrix factorization subroutine. For a cache blocking factor nb, the number of iterations would be (nb-2). I am not going to put nb as a function argument, as I did in the earlier post. This is a machine specific parameter will be defined as a macro. So the number of iterations is known at compiled time, but may vary from machine to machine. Guess how much tedious work I have to do in manual loop unrolling for a variety of nb. So if there is a way to simply instruct the compiler to peel a loop, that is great.
我会非常AP preciated,如果你也能生产高性能,便携库分享一些经验。
I would be very appreciated if you can also share some experience in producing high performance, yet portable library.
推荐答案
尝试调整优化参数:
gcc -O3 -funroll-loops --param max-completely-peeled-insns=1000 --param max-completely-peel-times=100
这应该做的伎俩。
这篇关于如何让GCC完全展开这个循环(即剥离这个循环)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!