在不同的优化级别在gcc / g ++的访问本地与全局变量的速度 [英] Speed of accessing local vs. global variables in gcc/g++ at different optimization levels

查看：123 发布时间：2016/8/18 22:26:01 c++ c optimization gcc g++

本文介绍了在不同的优化级别在gcc / g ++的访问本地与全局变量的速度的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我发现，在不同的gcc编译器优化级别访问本地或在循环中的全局变量时，给出完全不同的结果。这让我感到惊讶的原因是，如果访问一个类型的变量要比访问另一个更优化的，我想gcc的优化将利用这个事实。
来到这里的两个例子（在C ++中，但相应的C语言给出几乎相同的时序）：

I found that different compiler optimization levels in gcc give quite different results when accessing a local or a global variable in a loop. The reason this surprised me is that if access to one type of variable is more optimizable than access to another, I would think gcc optimization would exploit that fact. Here come two examples (in C++ but their C counterparts give practically the same timings):

    global = 0;
    for (int i = 0; i < SIZE; i++)
        global++;

它采用了全局变量全球长，与

    long tmp = 0;
    for (int i = 0; i < SIZE; i++)
        tmp++;
    global = tmp;

目前优化级别-O0定时基本上等于（如我所期望的），在-O1它有点快，但还是相等的，但是从-O2使用全局变量的版本要快得多（一个因子7左右）。

At optimization level -O0 the timing is essentially equal (as I would expect), at -O1 it is somewhat faster but still equal, but from -O2 the version using the global variable is much faster (a factor 7 or so).

在另一方面，在以下code片段哪里开始点尺寸的SIZE字节块

On the other hand, in the following code fragments where start points to a block of bytes of size SIZE:

    global = 0;
    for (const char* p = start; p < start + SIZE; p++)
        global += *p;

与

    long tmp = 0;
    for (const char* p = start; p < start + SIZE; p++)
        tmp += *p;
    global = tmp;

在这里，在-O0时序接近，尽管使用的局部变量的版本稍微快一些，这似乎不是太奇怪，因为也许它会被存储在寄存器中，而全球不会。然后在-O1和更高使用本地变量的版本是相当快（50％以上，或1.5倍）。正如前面所说，这让我吃惊，因为我会认为海合会这将是对我来说使用本地变量（在生成的优化code）分配给全局的以后容易。

Here at -O0 the timings are close, though the version using the local variable is slightly faster, which doesn't seem too surprising, as maybe it will be stored in a register, whereas global wouldn't. Then at -O1 and higher the version using a local variable is considerably faster (more than 50% or 1.5 times). As remarked before, this surprises me, because I would think that for gcc it would be as easy as for me to use a local variable (in the generated optimized code) to assign to the global one later on.

所以我的问题是：什么是关于全局和局部变量，使得该GCC只能执行某些优化，以一类，而不是其他。

So my question is: what is it about global and local variables that makes that gcc can only perform certain optimizations to one type, not the other?

有些细节可能会或可能不会是相关的：我用RHEL4运行两个单核处理器和4GB内存的机器上的gcc / G ++ 3.4.5版本。我用大小的值，这是一个preprocessor宏，是1000000000.字节在第二个例子中的块是动态分配的。

Some details that may or may not be relevant: I used gcc/g++ version 3.4.5 on a machine running RHEL4 with two single core processors and 4GB RAM. The value I used for SIZE, which is a preprocessor macro, was 1000000000. The block of bytes in the second example was dynamically allocated.

下面是用于优化级别0-4一些定时输出（在与上述相同的顺序）：

Here are some timing outputs for optimization levels 0 to 4 (in the same order as above):

$ ./st0
Result using global variable: 1000000000 in 2.213 seconds.
Result using local variable:  1000000000 in 2.210 seconds.
Result using global variable: 0 in 3.924 seconds.
Result using local variable:  0 in 3.710 seconds.
$ ./st1
Result using global variable: 1000000000 in 0.947 seconds.
Result using local variable:  1000000000 in 0.947 seconds.
Result using global variable: 0 in 2.135 seconds.
Result using local variable:  0 in 1.212 seconds.
$ ./st2
Result using global variable: 1000000000 in 0.022 seconds.
Result using local variable:  1000000000 in 0.552 seconds.
Result using global variable: 0 in 2.135 seconds.
Result using local variable:  0 in 1.227 seconds.
$ ./st3
Result using global variable: 1000000000 in 0.065 seconds.
Result using local variable:  1000000000 in 0.461 seconds.
Result using global variable: 0 in 2.453 seconds.
Result using local variable:  0 in 1.646 seconds.
$ ./st4
Result using global variable: 1000000000 in 0.063 seconds.
Result using local variable:  1000000000 in 0.468 seconds.
Result using global variable: 0 in 2.467 seconds.
Result using local variable:  0 in 1.663 seconds.

修改
这是为前两个片段与开关-O2，其中该差是最大的情况下所产生的组件。对于据我了解，它看起来像编译器中的一个bug：0x3b9aca00 SIZE是十六进制，0x80496dc必须是全球性的地址。
我用一个较新的编译器检查，这不会再发生。在第二对片段的区别在于然而类似

EDIT This is the generated assembly for the first two snippets with switch -O2, the case where the difference is largest. For as far as I understand, it looks like a bug in the compiler: 0x3b9aca00 is SIZE in hexadecimal, 0x80496dc must be the address of global. I checked with a newer compiler, and this doesn't happen anymore. The difference in the second pair of snippets is similar however.

    void global1()
    {
        int i;
        global = 0;
        for (i = 0; i < SIZE; i++)
            global++;
    }

    void local1()
    {
        int i;
        long tmp = 0;
        for (i = 0; i < SIZE; i++)
            tmp++;
        global = tmp;
    }

    080483d0 <global1>:
     80483d0:   55                      push   %ebp
     80483d1:   89 e5                   mov    %esp,%ebp
     80483d3:   c7 05 dc 96 04 08 00    movl   $0x0,0x80496dc
     80483da:   00 00 00 
     80483dd:   b8 ff c9 9a 3b          mov    $0x3b9ac9ff,%eax
     80483e2:   89 f6                   mov    %esi,%esi
     80483e4:   83 e8 19                sub    $0x19,%eax
     80483e7:   79 fb                   jns    80483e4 <global1+0x14>
     80483e9:   c7 05 dc 96 04 08 00    movl   $0x3b9aca00,0x80496dc
     80483f0:   ca 9a 3b 
     80483f3:   c9                      leave  
     80483f4:   c3                      ret    
     80483f5:   8d 76 00                lea    0x0(%esi),%esi

    080483f8 <local1>:
     80483f8:   55                      push   %ebp
     80483f9:   89 e5                   mov    %esp,%ebp
     80483fb:   b8 ff c9 9a 3b          mov    $0x3b9ac9ff,%eax
     8048400:   48                      dec    %eax
     8048401:   79 fd                   jns    8048400 <local1+0x8>
     8048403:   c7 05 dc 96 04 08 00    movl   $0x3b9aca00,0x80496dc
     804840a:   ca 9a 3b 
     804840d:   c9                      leave  
     804840e:   c3                      ret    
     804840f:   90                      nop

最后这里是采用-O3剩余的片段，现在GCC 4.3.3产生的code（虽然老版本似乎产生类似code）。它看起来像确实global2（..）编译为在循环，其中local2（..）使用一个寄存器的每次迭代访问全局存储器位置的功能。为什么使用寄存器反正GCC不会优化全球版本现在还不清楚给我。这只是一个缺乏的功能，否则真的会导致可执行文件的不可接受的行为？

Finally here is the code of the remaining snippets, now generated by gcc 4.3.3 using -O3 (though the old version seems to generate similar code). It looks like indeed global2(..) compiles to a function accessing the global memory location in every iteration of the loop, where local2(..) uses a register. It is still not clear to me why gcc wouldn't optimize the global version using a register anyway. Is this just a lacking feature, or would it really lead to unacceptable behaviour of the executable?

    void global2(const char* start)
    {
        const char* p;
        global = 0;
        for (p = start; p < start + SIZE; p++)
            global += *p;
    }

    void local2(const char* start)
    {
        const char* p;
        long tmp = 0;
        for (p = start; p < start + SIZE; p++)
            tmp += *p;
        global = tmp;
    }

    08048470 <global2>:
     8048470:   55                      push   %ebp
     8048471:   31 d2                   xor    %edx,%edx
     8048473:   89 e5                   mov    %esp,%ebp
     8048475:   8b 4d 08                mov    0x8(%ebp),%ecx
     8048478:   c7 05 24 a0 04 08 00    movl   $0x0,0x804a024
     804847f:   00 00 00 
     8048482:   8d b6 00 00 00 00       lea    0x0(%esi),%esi
     8048488:   0f be 04 11             movsbl (%ecx,%edx,1),%eax
     804848c:   83 c2 01                add    $0x1,%edx
     804848f:   01 05 24 a0 04 08       add    %eax,0x804a024
     8048495:   81 fa 00 ca 9a 3b       cmp    $0x3b9aca00,%edx
     804849b:   75 eb                   jne    8048488 <global2+0x18>
     804849d:   5d                      pop    %ebp
     804849e:   c3                      ret    
     804849f:   90                      nop    

    080484a0 <local2>:
     80484a0:   55                      push   %ebp
     80484a1:   31 c9                   xor    %ecx,%ecx
     80484a3:   89 e5                   mov    %esp,%ebp
     80484a5:   31 d2                   xor    %edx,%edx
     80484a7:   53                      push   %ebx
     80484a8:   8b 5d 08                mov    0x8(%ebp),%ebx
     80484ab:   90                      nop    
     80484ac:   8d 74 26 00             lea    0x0(%esi,%eiz,1),%esi
     80484b0:   0f be 04 13             movsbl (%ebx,%edx,1),%eax
     80484b4:   83 c2 01                add    $0x1,%edx
     80484b7:   01 c1                   add    %eax,%ecx
     80484b9:   81 fa 00 ca 9a 3b       cmp    $0x3b9aca00,%edx
     80484bf:   75 ef                   jne    80484b0 <local2+0x10>
     80484c1:   5b                      pop    %ebx
     80484c2:   89 0d 24 a0 04 08       mov    %ecx,0x804a024
     80484c8:   5d                      pop    %ebp
     80484c9:   c3                      ret    
     80484ca:   8d b6 00 00 00 00       lea    0x0(%esi),%esi

感谢。

在不同的优化级别在gcc / g ++的访问本地与全局变量的速度 [英] Speed of accessing local vs. global variables in gcc/g++ at different optimization levels

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录关闭

在不同的优化级别在gcc / g ++的访问本地与全局变量的速度 [英] Speed of accessing local vs. global variables in gcc/g++ at different optimization levels

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录 关闭

登录关闭