如何让GCC完全展开这个循环(即剥离这个循环)? [英] How to ask GCC to completely unroll this loop (i.e., peel this loop)?

查看:1697
本文介绍了如何让GCC完全展开这个循环(即剥离这个循环)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有没有一种方法来指示GCC(版本我用4.8.4),以展开在底层函数的完全,即剥离这个循环while循环?循环的迭代数目在编译时已知:58

Is there a way to instruct GCC (version I used 4.8.4) to unroll the while loop in the bottom function completely, i.e., peel this loop? The number of iterations of the loop is known at compilation time: 58.

让我先解释什么,我都试过了。

Let me first explain what I have tried.

通过检查GAS输出中:

By checking GAS ouput:

gcc -fpic -O2 -S GEPDOT.c

12寄存器XMM0 - XMM11被使用。如果我通过了国旗的 -funroll-循环以GCC:

gcc -fpic -O2 -funroll-loops -S GEPDOT.c

环路仅展开两次。我查了GCC优化选项。 GCC说, -funroll-循环会打开 -frename寄存器为好,所以当GCC解开一个循环,其先前的寄存器分配的选择是使用遗留寄存器。但也有4只遗留XMM12 - XMM15,所以GCC只能在其最好解开的2倍。曾是48而不是16 XMM寄存器可用那里,GCC​​将展开while循环4次无故障。

the loop is only unrolled two times. I checked the GCC optimization options. GCC says that -funroll-loops will turn on -frename-registers as well, so when GCC unrolls a loop, its prior choice for register allocation is to use "left over" registers. But there are only 4 left over XMM12 - XMM15, so GCC can only unroll 2 times at its best. Had there been 48 instead of 16 XMM registers available, GCC will unroll the while loop 4 times without trouble.

不过,我做了另一个实验。我先手动展开while循环两次,获得功能GEPDOT_2。再有就是

Yet I did another experiment. I first unrolled the while loop two time manually, obtaining a function GEPDOT_2. Then there is no difference at all between

gcc -fpic -O2 -S GEPDOT_2.c

gcc -fpic -O2 -funroll-loops -S GEPDOT_2.c

由于GEPDOT_2已经用完了所有的寄存器,不展开执行。

Since GEPDOT_2 already used up all registers, no unrolling is performed.

GCC没有寄存器重命名,以避免潜力假依赖引进。但我知道肯定会有我的GEPDOT没有这样的潜力;即使有,这并不重要。我试图展开循环自己,展开4次超过2展开倍,比不展开快了快了。当然,我可以手动展开多次,但它是单调乏味的。 GCC能为我做到这一点?谢谢你。

GCC does register renaming to avoid potential false dependency introduced. But I know for sure that there will be no such potential in my GEPDOT; even if there is, it is not important. I tried unrolling the loop myself, and unrolling 4 times is faster than unrolling 2 times, faster than no unrolling. Of course I can manually unroll more times, but it is tedious. Can GCC do this for me? Thanks.

// C file "GEPDOT.c"
#include <emmintrin.h>

void GEPDOT (double *A, double *B, double *C) {
  __m128d A1_vec = _mm_load_pd(A); A += 2;
  __m128d B_vec = _mm_load1_pd(B); B++;
  __m128d C1_vec = A1_vec * B_vec;
  __m128d A2_vec = _mm_load_pd(A); A += 2;
  __m128d C2_vec = A2_vec * B_vec;
  B_vec = _mm_load1_pd(B); B++;
  __m128d C3_vec = A1_vec * B_vec;
  __m128d C4_vec = A2_vec * B_vec;
  B_vec = _mm_load1_pd(B); B++;
  __m128d C5_vec = A1_vec * B_vec;
  __m128d C6_vec = A2_vec * B_vec;
  B_vec = _mm_load1_pd(B); B++;
  __m128d C7_vec = A1_vec * B_vec;
  A1_vec = _mm_load_pd(A); A += 2;
  __m128d C8_vec = A2_vec * B_vec;
  B_vec = _mm_load1_pd(B); B++;
  int k = 58;
  /* can compiler unroll the loop completely (i.e., peel this loop)? */
  while (k--) {
    C1_vec += A1_vec * B_vec;
    A2_vec = _mm_load_pd(A); A += 2;
    C2_vec += A2_vec * B_vec;
    B_vec = _mm_load1_pd(B); B++;
    C3_vec += A1_vec * B_vec;
    C4_vec += A2_vec * B_vec;
    B_vec = _mm_load1_pd(B); B++;
    C5_vec += A1_vec * B_vec;
    C6_vec += A2_vec * B_vec;
    B_vec = _mm_load1_pd(B); B++;
    C7_vec += A1_vec * B_vec;
    A1_vec = _mm_load_pd(A); A += 2;
    C8_vec += A2_vec * B_vec;
    B_vec = _mm_load1_pd(B); B++;
    }
  C1_vec += A1_vec * B_vec;
  A2_vec = _mm_load_pd(A);
  C2_vec += A2_vec * B_vec;
  B_vec = _mm_load1_pd(B); B++;
  C3_vec += A1_vec * B_vec;
  C4_vec += A2_vec * B_vec;
  B_vec = _mm_load1_pd(B); B++;
  C5_vec += A1_vec * B_vec;
  C6_vec += A2_vec * B_vec;
  B_vec = _mm_load1_pd(B);
  C7_vec += A1_vec * B_vec;
  C8_vec += A2_vec * B_vec;
  /* [write-back] */
  A1_vec = _mm_load_pd(C); C1_vec = A1_vec - C1_vec;
  A2_vec = _mm_load_pd(C + 2); C2_vec = A2_vec - C2_vec;
  A1_vec = _mm_load_pd(C + 4); C3_vec = A1_vec - C3_vec;
  A2_vec = _mm_load_pd(C + 6); C4_vec = A2_vec - C4_vec;
  A1_vec = _mm_load_pd(C + 8); C5_vec = A1_vec - C5_vec;
  A2_vec = _mm_load_pd(C + 10); C6_vec = A2_vec - C6_vec;
  A1_vec = _mm_load_pd(C + 12); C7_vec = A1_vec - C7_vec;
  A2_vec = _mm_load_pd(C + 14); C8_vec = A2_vec - C8_vec;
  _mm_store_pd(C,C1_vec); _mm_store_pd(C + 2,C2_vec);
  _mm_store_pd(C + 4,C3_vec); _mm_store_pd(C + 6,C4_vec);
  _mm_store_pd(C + 8,C5_vec); _mm_store_pd(C + 10,C6_vec);
  _mm_store_pd(C + 12,C7_vec); _mm_store_pd(C + 14,C8_vec);
  }


更新1

多亏了@ user3386109的评论,我想延长这个问题一点点。 @ user3386109提出了一个很好的问题。其实我对编译器的最佳寄存器分配能力有些怀疑,当有这么多的并行指令来安排。

Thanks to the comment by @user3386109, I would like to extend this question a little bit. @user3386109 raises a very good question. Actually I do have some doubt on compiler's ability for optimal register allocation, when there are so many parallel instructions to schedule.

我个人认为,一个可靠的方法是先code中的循环体(这是关键HPC)在 ASM 内联汇编,然后复制它很多次,因为我想。今年早些时候我有一个不受欢迎的帖子:<一href=\"http://stackoverflow.com/questions/35189619/inline-assembly-in-c-assembler-messages-error-unknown-pseudo-op\">inline装配。在code是一个有点不同,因为循环迭代,j的数量,是一个函数参数在编译时未知的,因此。在这种情况下,我不能完全展开循环,所以我只重复大会code两次,并转换环路成一个标签和跳跃。原来,生成的我的书面装配的性能比生成的汇编编译器高出约5%,这可能表明,编译器无法在我们的预期,最佳的方式分配寄存器。

I personally think that a reliable way is to first code the loop body (which is key to HPC) in asm inline assembly, then duplicate it as many times as I want. I had an unpopular post earlier this year: inline assembly. The code was a little different because the number of loop iterations, j, is a function argument hence unknown at compilation time. In that case I can not fully unroll the loop, so I only duplicated the assembly code twice, and converted the loop into a label and jump. It turned out that the resulting performance of my written assembly is about 5% higher than compiler generated assembly, which might suggest that compiler fails to allocate registers in our expected, optimal manner.

我是(现在也还是)组装编码一个宝宝,所以,供应良好的案例研究,我学习上的x86汇编一点点。但是从长远来看,我不倾向于code GEPDOT用大比例进行组装。主要有三个原因:

I was (and am still) a baby in assembly coding, so that serves a good case study for me to learn a little bit on x86 assembly. But in a long run I do not incline to code GEPDOT with a big proportion for assembly. There are mainly three reasons:


  1. ASM 内联汇编已经critisized不被移植。虽然我不明白为什么。也许是因为不同的机器有不同的寄存器重挫?

  2. 编译器也渐入佳境。所以,我仍然会preFER算法优化和更好的C编码习惯,以帮助编译器产生良好的输出;

  3. 最后一个原因是更重要的。迭代次数可能并不总是58.我开发高性能的矩阵分解子程序。对于一个高速缓存块因子的 NB 的,迭代的次数会的(NB-2)的。我不会把的 NB 的作为函数参数,正如我在前面的帖子一样。这是一个机器特定参数将被定义为宏。这样的迭代次数是在编译时已知的,但也可以从机器到机器不同而不同。猜猜我有多繁琐的工作在手动循环展开了各种的 NB 的做。所以,如果有一种方法可以简单地指示编译器剥离一个循环,这是伟大的。

  1. asm inline assembly has been critisized for not being portable. Though I don't understand why. Perhaps because different machines have different registers clobbered?
  2. Compiler is also getting better. So I would still prefer to algorithmic optimization and better C coding habit to assist compiler in generating good output;
  3. The last reason is more important. The number of iterations may not always be 58. I am developing a high performance matrix factorization subroutine. For a cache blocking factor nb, the number of iterations would be (nb-2). I am not going to put nb as a function argument, as I did in the earlier post. This is a machine specific parameter will be defined as a macro. So the number of iterations is known at compiled time, but may vary from machine to machine. Guess how much tedious work I have to do in manual loop unrolling for a variety of nb. So if there is a way to simply instruct the compiler to peel a loop, that is great.

我会非常AP preciated,如果你也能生产高性能,便携库分享一些经验。

I would be very appreciated if you can also share some experience in producing high performance, yet portable library.

推荐答案

尝试调整优化参数:

gcc -O3 -funroll-loops --param max-completely-peeled-insns=1000 --param max-completely-peel-times=100

这应该做的伎俩。

这篇关于如何让GCC完全展开这个循环(即剥离这个循环)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆