对于循环性能差异,以及编译器优化 [英] For loop performance difference, and compiler optimization

查看:170
本文介绍了对于循环性能差异,以及编译器优化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述




我选择了David的答案,因为他是唯一一个提出没有优化标志的for-loops的差异的解决方案。






Jerry Coffin的回答解释了设置优化标志时会发生什么对于这个例子。仍然没有回答的是为什么superCalculationA运行慢于superCalculationB,当B执行一个额外的内存引用和每次迭代一个附加。 Nemo的帖子显示了汇编器输出。我确认这个编译与我的电脑上的 -S 标志,2.9GHz桑迪桥(i5-2310),运行Ubuntu 12.04 64位,如Matteo意大利建议。 / p>




当我遇到下列情况时,我正在尝试for-loops性能。



我有以下代码以两种不同的方式执行相同的计算。

  #include< cstdint> ; 
#include< chrono>
#include< cstdio>

使用std :: uint64_t;

uint64_t superCalculationA(int init,int end)
{
uint64_t total = 0;
for(int i = init; i total + = i;
return total;
}

uint64_t superCalculationB(int init,int todo)
{
uint64_t total = 0;
for(int i = init; i total + = i;
return total;
}

int main()
{
const uint64_t answer = 500000110500000000;

std :: chrono :: time_point< std :: chrono :: high_resolution_clock>开始,结束
double elapsed;

std :: printf(================ ================= \\\
);

start = std :: chrono :: high_resolution_clock :: now();
uint64_t ret1 = superCalculationA(111,1000000111);
end = std :: chrono :: high_resolution_clock :: now();
elapsed =(end - start).count()*((double)std :: chrono :: high_resolution_clock :: period :: num / std :: chrono :: high_resolution_clock :: period :: den);
std :: printf(经过时间:%.3f s |%.3f ms |%.3f us\\\
,已经过,1e + 3 *已过,1e + 6 *已过);

start = std :: chrono :: high_resolution_clock :: now();
uint64_t ret2 = superCalculationB(111,1000000000);
end = std :: chrono :: high_resolution_clock :: now();
elapsed =(end - start).count()*((double)std :: chrono :: high_resolution_clock :: period :: num / std :: chrono :: high_resolution_clock :: period :: den);
std :: printf(经过时间:%.3f s |%.3f ms |%.3f us\\\
,已经过,1e + 3 *已过,1e + 6 *已过);

if(ret1 == answer)
{
std :: printf(第一种方法,即superCalculationA,succeeded.\\\
);
}
if(ret2 == answer)
{
std :: printf(第二种方法,即superCalculationB,succeeded.\\\
);
}

std :: printf(=============================== ====================== \ n);

return 0;
}

使用

编译此代码


g ++ main.cpp -o输出--std = c ++ 11


以下结果:

  ====================已激活的时间:2.859秒| 2859.441 ms | 2859440.968 us 
已过期:2.204 s | 2204.059 ms | 2204059.262 us
第一种方法,即superCalculationA,成功。
第二种方法,即superCalculationB,成功。
============================================== =======

我的第一个问题是:比第一个快了23%?



另一方面,如果我用

编译代码

< blockquote>

g ++ main.cpp -o输出--std = c ++ 11 -O1


结果改善了很多,

  =======================已用时间:0.318 s | 317.773 ms | 317773.142 us 
已过期:0.314 s | 314.429 ms | 314429.393 us
第一种方法,即superCalculationA,成功。
第二种方法,即superCalculationB,成功。
============================================== =======

,时差几乎消失。



但是我在设置-O2标志时不敢相信我的眼睛,


g ++ main.cpp -o输出--std = c ++ 11 -O2


,并得到:

  ================================== ================= 
已用时间:0.000 s | 0.000ms | 0.328 us
已用时间:0.000 s | 0.000ms | 0.208 us
第一种方法,即superCalculationA,成功。
第二种方法,即superCalculationB,成功。
============================================== =======

因此,我的第二个问题是:当我设置-O1和-O2标志,导致这种巨大的性能改进?



我选中了优化的选项 - 使用GNU编译器集合(GCC),但没有澄清的事情。






顺便说一句,我用g ++(GCC)4.9.1编译这段代码。



< hr>

编辑以确认Basile Starynkevitch的假设



我编辑了代码,现在 main 看起来像这样:

  int main(int argc,char ** argv) 
{
int start = atoi(argv [1]);
int end = atoi(argv [2]);
int delta = end - start + 1;

std :: chrono :: time_point< std :: chrono :: high_resolution_clock> t_start,t_end;
double elapsed;

std :: printf(================ ================= \\\
);

t_start = std :: chrono :: high_resolution_clock :: now();
uint64_t ret1 = superCalculationB(start,delta);
t_end = std :: chrono :: high_resolution_clock :: now();
elapsed =(t_end - t_start).count()*((double)std :: chrono :: high_resolution_clock :: period :: num / std :: chrono :: high_resolution_clock :: period :: den);
std :: printf(经过时间:%.3f s |%.3f ms |%.3f us\\\
,已经过,1e + 3 *已过,1e + 6 *已过);

t_start = std :: chrono :: high_resolution_clock :: now();
uint64_t ret2 = superCalculationA(start,end);
t_end = std :: chrono :: high_resolution_clock :: now();
elapsed =(t_end - t_start).count()*((double)std :: chrono :: high_resolution_clock :: period :: num / std :: chrono :: high_resolution_clock :: period :: den);
std :: printf(经过时间:%.3f s |%.3f ms |%.3f us\\\
,已经过,1e + 3 *已过,1e + 6 *已过);

std :: printf(results were%s\\\
,(ret1 == ret2)?same!:different!
std :: printf(==================== ============= \\\
);

return 0;
}

这些修改真的增加了计算时间, O1 和 -O2 。两个都给我大约620毫秒。 这证明-O2在编译时真的做了一些计算



我还是不明白这些标志在提高性能,和 -Ofast 更好,在约320ms。



也注意到我改变了函数的顺序A和B被称为测试Jerry Coffin的假设。编译这段代码没有优化器标志仍然给我2.2秒的B和2.8秒的A.所以我认为它不是一个缓存的东西。只是加强我在第一种情况下(没有标志的那个)的优化,我只想知道什么使秒循环比第一个更快。

解决方案

编辑:了解更多关于处理器流水线中的依赖关系后,我修改了我的答案,删除一些不必要的细节,并提供更具体的解释放慢。 / p>




似乎-O0情况下的性能差异是由于处理器流水线造成的。



首先,从Nemo的回答复制的程序集(对于-O0构建)和我自己的一些注释inline:

  superCalculationA(int,int):
pushq%rbp
movq%rsp,%rbp
movl%edi,-20(%rbp)#init
movl%esi,-24(%rbp)#end
movq $ 0,-8(%rbp)#total = 0
movl -20(%rbp),%eax#copy init to register rax
movl%eax,-12(%rbp)#i = [rax]
jmp .L7
.L8:
movl -12(%rbp),%eax#copy i注册rax
cltq
addq%rax,-8(%rbp)#total + = [rax]
addl $ 1,-12(%rbp)#i ++
。 :
movl -12(%rbp),%eax#copy i to register rax
cmpl -24(%rbp),%eax#[rax] end
jl .L8
movq -8(%rbp),%rax
popq%rbp
ret

superCalculationB(int,int):
pushq%rbp
movq%rsp,%rbp
movl%edi,-20(%rbp)#init
movl%esi,-24(%rbp)#todo
movq $ 0,-8(%rbp)#total = 0
movl -20(%rbp),%eax#copy init注册rax
movl%eax,-12(%rbp)# i = [rax]
jmp .L11
.L12:
movl -12(%rbp),%eax#copy i to register rax
cltq
addq% rax,-8(%rbp)#total + = [rax]
addl $ 1,-12(%rbp)#i ++
.L11:
movl -20 edx#copy init注册rdx
movl -24(%rbp),%eax#copy todo注册rax
addl%edx,%eax#[rax] + = [rdx] ] = init + todo)
cmpl -12(%rbp),%eax#i < [rax]
jg .L12
movq -8(%rbp),%rax
popq%rbp
ret






在这两个函数中,堆栈布局如下:

  Addr内容

24 end / todo
20 init
16< empty>
12 i
08 total
04
00< base pointer>

(注意 total 因此占用两个4字节的槽)



这些是 superCalculationA()的关键行:

  addl $ 1,-12(%rbp)#i ++ 
.L7:
movl -12 %rbp),%eax#copy i to register rax
cmpl -24(%rbp),%eax#[rax] end

堆栈地址 -12(%rbp)(保存 i 的值),然后立即读取在下一条指令中。读指令只有在写操作完成后才能开始。这代表了流水线中的一个块,导致 superCalculationA()慢于 superCalculationB()



您可能会好奇为什么 superCalculationB()没有这个相同的管道块。它真的只是一个工具,如何gcc编译代码在-O0,并不代表任何根本上有趣。基本上,在 superCalculationA()中,比较 i 通过读取 i 寄存器,而在 superCalculationB(),比较 i 通过从堆栈中读取 i 来执行。



为了证明这只是一个工件,让我们替换

  for(int i = init; i< end; i ++)

  for(int i = init; end> i; i ++)

$ c> superCalculateA()。生成的程序集看起来是一样的,只需对关键行进行以下更改:

  addl $ 1,-12 )#i ++ 
.L7:
movl -24(%rbp),%eax#copy end to register rax
cmpl -12(%rbp),%eax#i< [rax]

现在 i 堆栈和管道块都没了。以下是进行此更改后的效果数字:

  ================已激活:2.296 s | 2295.812 ms | 2295812.000 us 
已过期:2.368 s | 2367.634 ms | 2367634.000 us
第一种方法,即superCalculationA,成功。
第二种方法,即superCalculationB,成功。
============================================== =======

应该注意,这是一个玩具示例,用-O0编译。在现实世界中,我们使用-O2或-O3编译。在这种情况下,编译器以这样的方式排序指令,以便最小化流水线块,我们不需要担心是否写 i end> i



I chose David's answer because he was the only one to present a solution to the difference in the for-loops with no optimization flags. The other answers demonstrate what happens when setting the optimization flags on.


Jerry Coffin's answer explained what happens when setting the optimization flags for this example. What remains unanswered is why superCalculationA runs slower than superCalculationB, when B performs one extra memory reference and one addition for each iteration. Nemo's post shows the assembler output. I confirmed this compiling with the -S flag on my PC, 2.9GHz Sandy Bridge (i5-2310), running Ubuntu 12.04 64-bit, as suggested by Matteo Italia.


I was experimenting with for-loops performance when I stumbled upon the following case.

I have the following code that does the same computation in two different ways.

#include <cstdint>
#include <chrono>
#include <cstdio>

using std::uint64_t;

uint64_t superCalculationA(int init, int end)
{
    uint64_t total = 0;
    for (int i = init; i < end; i++)
        total += i;
    return total;
}

uint64_t superCalculationB(int init, int todo)
{
    uint64_t total = 0;
    for (int i = init; i < init + todo; i++)
        total += i;
    return total;
}

int main()
{
    const uint64_t answer = 500000110500000000;

    std::chrono::time_point<std::chrono::high_resolution_clock> start, end;
    double elapsed;

    std::printf("=====================================================\n");

    start = std::chrono::high_resolution_clock::now();
    uint64_t ret1 = superCalculationA(111, 1000000111);
    end = std::chrono::high_resolution_clock::now();
    elapsed = (end - start).count() * ((double) std::chrono::high_resolution_clock::period::num / std::chrono::high_resolution_clock::period::den);
    std::printf("Elapsed time: %.3f s | %.3f ms | %.3f us\n", elapsed, 1e+3*elapsed, 1e+6*elapsed);

    start = std::chrono::high_resolution_clock::now();
    uint64_t ret2 = superCalculationB(111, 1000000000);
    end = std::chrono::high_resolution_clock::now();
    elapsed = (end - start).count() * ((double) std::chrono::high_resolution_clock::period::num / std::chrono::high_resolution_clock::period::den);
    std::printf("Elapsed time: %.3f s | %.3f ms | %.3f us\n", elapsed, 1e+3*elapsed, 1e+6*elapsed);

    if (ret1 == answer)
    {
        std::printf("The first method, i.e. superCalculationA, succeeded.\n");
    }
    if (ret2 == answer)
    {
        std::printf("The second method, i.e. superCalculationB, succeeded.\n");
    }

    std::printf("=====================================================\n");

    return 0;
}

Compiling this code with

g++ main.cpp -o output --std=c++11

leads to the following result:

=====================================================
Elapsed time: 2.859 s | 2859.441 ms | 2859440.968 us
Elapsed time: 2.204 s | 2204.059 ms | 2204059.262 us
The first method, i.e. superCalculationA, succeeded.
The second method, i.e. superCalculationB, succeeded.
=====================================================

My first question is: why is the second loop running 23% faster than the first?

On the other hand, if I compile the code with

g++ main.cpp -o output --std=c++11 -O1

The results improve a lot,

=====================================================
Elapsed time: 0.318 s | 317.773 ms | 317773.142 us
Elapsed time: 0.314 s | 314.429 ms | 314429.393 us
The first method, i.e. superCalculationA, succeeded.
The second method, i.e. superCalculationB, succeeded.
=====================================================

and the difference in time almost disappears.

But I could not believe my eyes when I set the -O2 flag,

g++ main.cpp -o output --std=c++11 -O2

and got this:

=====================================================
Elapsed time: 0.000 s | 0.000 ms | 0.328 us
Elapsed time: 0.000 s | 0.000 ms | 0.208 us
The first method, i.e. superCalculationA, succeeded.
The second method, i.e. superCalculationB, succeeded.
=====================================================

So, my second question is: What is the compiler doing when I set -O1 and -O2 flags that leads to this gigantic performance improvement?

I checked Optimized Option - Using the GNU Compiler Collection (GCC), but that did not clarify things.


By the way, I am compiling this code with g++ (GCC) 4.9.1.


EDIT to confirm Basile Starynkevitch's assumption

I edited the code, now main looks like this:

int main(int argc, char **argv)
{
    int start = atoi(argv[1]);
    int end   = atoi(argv[2]);
    int delta = end - start + 1;

    std::chrono::time_point<std::chrono::high_resolution_clock> t_start, t_end;
    double elapsed;

    std::printf("=====================================================\n");

    t_start = std::chrono::high_resolution_clock::now();
    uint64_t ret1 = superCalculationB(start, delta);
    t_end = std::chrono::high_resolution_clock::now();
    elapsed = (t_end - t_start).count() * ((double) std::chrono::high_resolution_clock::period::num / std::chrono::high_resolution_clock::period::den);
    std::printf("Elapsed time: %.3f s | %.3f ms | %.3f us\n", elapsed, 1e+3*elapsed, 1e+6*elapsed);

    t_start = std::chrono::high_resolution_clock::now();
    uint64_t ret2 = superCalculationA(start, end);
    t_end = std::chrono::high_resolution_clock::now();
    elapsed = (t_end - t_start).count() * ((double) std::chrono::high_resolution_clock::period::num / std::chrono::high_resolution_clock::period::den);
    std::printf("Elapsed time: %.3f s | %.3f ms | %.3f us\n", elapsed, 1e+3*elapsed, 1e+6*elapsed);

    std::printf("Results were %s\n", (ret1 == ret2) ? "the same!" : "different!");
    std::printf("=====================================================\n");

    return 0;
}

These modifications really increased computation time, both for -O1 and -O2. Both are giving me around 620 ms now. Which proves that -O2 was really doing some computation at compile time.

I still do not understand what these flags are doing to improve performance, and -Ofast does even better, at about 320ms.

Also notice that I have changed the order in which functions A and B are called to test Jerry Coffin's assumption. Compiling this code with no optimizer flags still gives me around 2.2 secs in B and 2.8 secs in A. So I figure that it is not a cache thing. Just reinforcing that I am not talking about optimization in the first case (the one with no flags), I just want to know what makes the seconds loop run faster than the first.

解决方案

EDIT: After learning more about dependencies in processor pipelining, I revised my answer, removing some unnecessary details and offering a more concrete explanation of the slowdown.


It appears that the performance difference in the -O0 case is due to processor pipelining.

First, the assembly (for the -O0 build), copied from Nemo's answer, with some of my own comments inline:

superCalculationA(int, int):
    pushq   %rbp
    movq    %rsp, %rbp
    movl    %edi, -20(%rbp)    # init
    movl    %esi, -24(%rbp)    # end
    movq    $0, -8(%rbp)       # total = 0
    movl    -20(%rbp), %eax    # copy init to register rax
    movl    %eax, -12(%rbp)    # i = [rax]
    jmp .L7
.L8:
    movl    -12(%rbp), %eax    # copy i to register rax
    cltq
    addq    %rax, -8(%rbp)     # total += [rax]
    addl    $1, -12(%rbp)      # i++
.L7:
    movl    -12(%rbp), %eax    # copy i to register rax
    cmpl    -24(%rbp), %eax    # [rax] < end
    jl  .L8
    movq    -8(%rbp), %rax
    popq    %rbp
    ret

superCalculationB(int, int):
    pushq   %rbp
    movq    %rsp, %rbp
    movl    %edi, -20(%rbp)    # init
    movl    %esi, -24(%rbp)    # todo
    movq    $0, -8(%rbp)       # total = 0
    movl    -20(%rbp), %eax    # copy init to register rax
    movl    %eax, -12(%rbp)    # i = [rax]
    jmp .L11
.L12:
    movl    -12(%rbp), %eax    # copy i to register rax
    cltq
    addq    %rax, -8(%rbp)     # total += [rax]
    addl    $1, -12(%rbp)      # i++
.L11:
    movl    -20(%rbp), %edx    # copy init to register rdx
    movl    -24(%rbp), %eax    # copy todo to register rax
    addl    %edx, %eax         # [rax] += [rdx]  (so [rax] = init+todo)
    cmpl    -12(%rbp), %eax    # i < [rax]
    jg  .L12
    movq    -8(%rbp), %rax
    popq    %rbp
    ret


In both functions, the stack layout looks like this:

Addr Content

24   end/todo
20   init
16   <empty>
12   i
08   total
04   
00   <base pointer>

(Note that total is a 64-bit int and so occupies two 4-byte slots.)

These are the key lines of superCalculationA():

    addl    $1, -12(%rbp)      # i++
.L7:
    movl    -12(%rbp), %eax    # copy i to register rax
    cmpl    -24(%rbp), %eax    # [rax] < end

The stack address -12(%rbp) (which holds the value of i) is written to in the addl instruction, and then it is immediately read in the very next instruction. The read instruction cannot begin until the write has completed. This represents a block in the pipeline, causing superCalculationA() to be slower than superCalculationB().

You might be curious why superCalculationB() doesn't have this same pipeline block. It's really just an artifact of how gcc compiles the code in -O0 and doesn't represent anything fundamentally interesting. Basically, in superCalculationA(), the comparison i<end is performed by reading i from a register, while in superCalculationB(), the comparison i<init+todo is performed by reading i from the stack.

To demonstrate that this is just an artifact, let's replace

for (int i = init; i < end; i++)

with

for (int i = init; end > i; i++)

in superCalculateA(). The generated assembly then looks the same, with just the following change to the key lines:

    addl    $1, -12(%rbp)      # i++
.L7:
    movl    -24(%rbp), %eax    # copy end to register rax
    cmpl    -12(%rbp), %eax    # i < [rax]

Now i is read from the stack, and the pipeline block is gone. Here are the performance numbers after making this change:

=====================================================
Elapsed time: 2.296 s | 2295.812 ms | 2295812.000 us
Elapsed time: 2.368 s | 2367.634 ms | 2367634.000 us
The first method, i.e. superCalculationA, succeeded.
The second method, i.e. superCalculationB, succeeded.
=====================================================

It should be noted that this is really a toy example, since we are compiling with -O0. In the real world, we compile with -O2 or -O3. In that case, the compiler orders the instructions in such a way so as to minimize pipeline blocks, and we don't need to worry about whether to write i<end or end>i.

这篇关于对于循环性能差异,以及编译器优化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆