优化的C code大会code冗余 [英] Assembly code redundancy in optimized C code

查看:204
本文介绍了优化的C code大会code冗余的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图通过研究的的gcc 的编译简单的C code。与-O3优化了解矢量化。更具体地讲,如何编译器量化。这是朝着能够验证个人的旅程的的gcc -O3 的更复杂的计算性能。据我所知,传统观念认为编译器是比人更好,但理所当然我从来没有采取这样的智慧。

I'm trying to learn about vectorization by studying simple C code compiled in gcc with -O3 optimization. More specifically, how well compilers vectorize. It is a personal journey towards being able to verify gcc -O3 performance with more complex computation. I understand that conventional wisdom is that compilers are better than people, but I never take such wisdom for granted.

在我的第一个简单的测试,不过,我发现一些选择的的gcc 的让人颇为奇怪的,坦白地说,重大过失的优化方面。我愿意承担有一些编译器是有目的的,知道一些关于CPU(英特尔i5-2557M在这种情况下),我不知道。但是,我需要有知识的人一定确认。

In my first simple test, though, I'm finding some of the choices gcc makes quite strange and, quite honestly, grossly negligent in terms of optimization. I'm willing to assume there is something the compiler is purposeful and knows something about the CPU (Intel i5-2557M in this case) that I do not. But I need some confirmation from knowledgeable people.

我的简单测试code(段)是:

My simple test code (segment) is:

int i;
float a[100];

for (i=0;i<100;i++) a[i]= (float) i*i;

所得到的组件code(段),其对应于对于循环如下:

The resulting assembly code (segment) that corresponds to the for-loop is as follows:

.L6:                        ; loop starts here
    movdqa  xmm0, xmm1      ; copy packed integers in xmm1 to xmm0
.L3:
    movdqa  xmm1, xmm0      ; wait, what!?  WHY!?  this is redundant.
    cvtdq2ps    xmm0, xmm0  ; convert integers to float
    add rax, 16             ; increment memory pointer for next iteration
    mulps   xmm0, xmm0      ; pack square all integers in xmm0
    paddd   xmm1, xmm2      ; pack increment all integers by 4 
    movaps  XMMWORD PTR [rax-16], xmm0   ; store result 
    cmp rax, rdx            ; test loop termination
    jne .L6                 

我理解所有的步骤和计算,这一切才有意义。我不明白,不过,是的的gcc 的选择在迭代循环纳入了一步加载的将xmm1 XMM0 右后的 XMM0 装满了将xmm1 。即

I understand all the steps, and computationally, all of it makes sense. What I don't understand, though, is gcc choosing to incorporate in the iterative loop a step to load xmm1 with xmm0 right after xmm0 was loaded with xmm1. i.e.

 .L6
        movdqa  xmm0, xmm1      ; loop starts here
 .L3
        movdqa  xmm1, xmm0      ; grrr! 

这本身就使得我怀疑优化的理智。显然,额外MOVDQA不会干扰数据,但在面值,它会显得严重疏忽的部分的的gcc

This alone makes me question the sanity of the optimizer. Obviously, the extra MOVDQA does not disturb data, but at face-value, it would seems grossly negligent on the part of gcc.

在组装code(未显示)早些时候, XMM0 XMM2 被初始化到一些有价值的,有意义的矢量,所以很明显,在对发病循环中,code具有跳过第一MOVDQA。但是,为什么不的的gcc 的简单重新排列,如下图所示。

Earlier in the assembly code (not shown), xmm0 and xmm2 are initialized to some value meaningful for vectorization, so obviously, at the onset of the loop, the code has to skip the first MOVDQA. But why doesn't gcc simply rearrange, as shown below.

.L3
        movdqa  xmm1, xmm0     ; initialize xmm1 PRIOR to loop
.L6
        movdqa  xmm0, xmm1     ; loop starts here 

甚至更好,只需初始化的将xmm1 而不是 XMM0 并转储MOVDQA 将xmm1 XMM0 步骤共!

Or even better, simply initialize xmm1 instead of xmm0 and dump the MOVDQA xmm1, xmm0 step altogether!

我ppared相信,CPU是足够聪明跳过冗余的步骤或类似的东西$ P $,但我怎么能相信的的gcc 的全面优化复杂的code,如果它甚至可以得到这个简单的code吧?或有人可以提供完善的解释,即会给我信心,的gcc -O3 的好东西?

I am prepared to believe that the CPU is smart enough to skip the redundant step or something like that, but how can I trust gcc to fully optimize complex code, if it can even get this simple code right? Or can someone provide a sound explanation that would give me faith that gcc -O3 is good stuff?

推荐答案

我不是100%肯定,但它看起来像你的循环破阵 XMM0 通过将其转换为浮动,所以你要在将xmm1 的整数值,然后通过复制到另一个寄存器(​​在这种情况下, XMM0 )。

I'm not 100% sure, but it looks like your loop destroys xmm0 by converting it to float, so you to have the integer value in xmm1 and then copy over to another register (in this case xmm0).

虽然编译器是众所周知的,有时发出不必要的指示,我实在看不出这是如何在这种情况下的情况。

Whilst compilers are known to sometimes issue unnecessary instructions, I can't really see how this is the case in this instance.

如果你想 XMM0 (或将xmm1 )保持整数,那么没有一个投浮动我。也许你想要做的是:

If you want xmm0 (or xmm1) to remain integer, then don't have a cast of float for the first value of i. Perhaps what you wanted to do is:

 for (i=0;i<100;i++) 
    a[i]= (float)(i*i);

但在另一方面,GCC 4.9.2似乎并没有做到这一点:

But on the other hand, gcc 4.9.2 doesn't seem to do this:

g++ -S -O3 floop.cpp

.L2:
    cvtdq2ps    %xmm1, %xmm0
    mulps   %xmm0, %xmm0
    addq    $16, %rax
    paddd   %xmm2, %xmm1
    movaps  %xmm0, -16(%rax)
    cmpq    %rbp, %rax
    jne .L2

也没有(约3个星期前3.7.0)铛

Nor does clang (3.7.0 from about 3 weeks ago)

 clang++ -S -O3 floop.cpp


    movdqa  .LCPI0_0(%rip), %xmm0   # xmm0 = [0,1,2,3]
    xorl    %eax, %eax
    .align  16, 0x90
.LBB0_1:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
    movd    %eax, %xmm1
    pshufd  $0, %xmm1, %xmm1        # xmm1 = xmm1[0,0,0,0]
    paddd   %xmm0, %xmm1
    cvtdq2ps    %xmm1, %xmm1
    mulps   %xmm1, %xmm1
    movaps  %xmm1, (%rsp,%rax,4)
    addq    $4, %rax
    cmpq    $100, %rax
    jne .LBB0_1

code,我已经编译:

Code that I have compiled:

extern int printf(const char *, ...);

int main()
{
    int i;
    float a[100];

    for (i=0;i<100;i++)
        a[i]= (float) i*i;

    for (i=0; i < 100; i++)
        printf("%f\n", a[i]);
}

(我增加了的printf避免编译器摆脱所有的code的)

(I added the printf to avoid the compiler getting rid of ALL the code)

这篇关于优化的C code大会code冗余的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆