在不同的优化级别在gcc / g ++的访问本地与全局变量的速度 [英] Speed of accessing local vs. global variables in gcc/g++ at different optimization levels

查看:123
本文介绍了在不同的优化级别在gcc / g ++的访问本地与全局变量的速度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我发现,在不同的gcc编译器优化级别访问本地或在循环中的全局变量时,给出完全不同的结果。这让我感到惊讶的原因是,如果访问一个类型的变量要比访问另一个更优化的,我想gcc的优化将利用这个事实。
来到这里的两个例子(在C ++中,但相应的C语言给出几乎相同的时序):

I found that different compiler optimization levels in gcc give quite different results when accessing a local or a global variable in a loop. The reason this surprised me is that if access to one type of variable is more optimizable than access to another, I would think gcc optimization would exploit that fact. Here come two examples (in C++ but their C counterparts give practically the same timings):

    global = 0;
    for (int i = 0; i < SIZE; i++)
        global++;

它采用了全局变量全球长,与

    long tmp = 0;
    for (int i = 0; i < SIZE; i++)
        tmp++;
    global = tmp;

目前优化级别-O0定时基本上等于(如我所期望的),在-O1它有点快,但还是相等的,但是从-O2使用全局变量的版本要快得多(一个因子7左右)。

At optimization level -O0 the timing is essentially equal (as I would expect), at -O1 it is somewhat faster but still equal, but from -O2 the version using the global variable is much faster (a factor 7 or so).

在另一方面,在以下code片段哪里开始点尺寸的SIZE字节块

On the other hand, in the following code fragments where start points to a block of bytes of size SIZE:

    global = 0;
    for (const char* p = start; p < start + SIZE; p++)
        global += *p;

    long tmp = 0;
    for (const char* p = start; p < start + SIZE; p++)
        tmp += *p;
    global = tmp;

在这里,在-O0时序接近,尽管使用的局部变量的版本稍微快一些,这似乎不是太奇怪,因为也许它会被存储在寄存器中,而全球不会。然后在-O1和更高使用本地变量的版本是相当快(50%以上,或1.5倍)。正如前面所说,这让我吃惊,因为我会认为海合会这将是对我来说使用本地变量(在生成的优化code)分配给全局的以后容易。

Here at -O0 the timings are close, though the version using the local variable is slightly faster, which doesn't seem too surprising, as maybe it will be stored in a register, whereas global wouldn't. Then at -O1 and higher the version using a local variable is considerably faster (more than 50% or 1.5 times). As remarked before, this surprises me, because I would think that for gcc it would be as easy as for me to use a local variable (in the generated optimized code) to assign to the global one later on.

所以我的问题是:什么是关于全局和局部变量,使得该GCC只能执行某些优化,以一类,而不是其他。

So my question is: what is it about global and local variables that makes that gcc can only perform certain optimizations to one type, not the other?

有些细节可能会或可能不会是相关的:我用RHEL4运行两个单核处理器和4GB内存的机器上的gcc / G ++ 3.4.5版本。我用大小的值,这是一个preprocessor宏,是1000000000.字节在第二个例子中的块是动态分配的。

Some details that may or may not be relevant: I used gcc/g++ version 3.4.5 on a machine running RHEL4 with two single core processors and 4GB RAM. The value I used for SIZE, which is a preprocessor macro, was 1000000000. The block of bytes in the second example was dynamically allocated.

下面是用于优化级别0-4一些定时输出(在与上述相同的顺序):

Here are some timing outputs for optimization levels 0 to 4 (in the same order as above):

$ ./st0
Result using global variable: 1000000000 in 2.213 seconds.
Result using local variable:  1000000000 in 2.210 seconds.
Result using global variable: 0 in 3.924 seconds.
Result using local variable:  0 in 3.710 seconds.
$ ./st1
Result using global variable: 1000000000 in 0.947 seconds.
Result using local variable:  1000000000 in 0.947 seconds.
Result using global variable: 0 in 2.135 seconds.
Result using local variable:  0 in 1.212 seconds.
$ ./st2
Result using global variable: 1000000000 in 0.022 seconds.
Result using local variable:  1000000000 in 0.552 seconds.
Result using global variable: 0 in 2.135 seconds.
Result using local variable:  0 in 1.227 seconds.
$ ./st3
Result using global variable: 1000000000 in 0.065 seconds.
Result using local variable:  1000000000 in 0.461 seconds.
Result using global variable: 0 in 2.453 seconds.
Result using local variable:  0 in 1.646 seconds.
$ ./st4
Result using global variable: 1000000000 in 0.063 seconds.
Result using local variable:  1000000000 in 0.468 seconds.
Result using global variable: 0 in 2.467 seconds.
Result using local variable:  0 in 1.663 seconds.

修改
这是为前两个片段与开关-O2,其中该差是最大的情况下所产生的组件。对于据我了解,它看起来像编译器中的一个bug:0x3b9aca00 SIZE是十六进制,0x80496dc必须是全球性的地址。
我用一个较新的编译器检查,这不会再发生。在第二对片段的区别在于然而类似

EDIT This is the generated assembly for the first two snippets with switch -O2, the case where the difference is largest. For as far as I understand, it looks like a bug in the compiler: 0x3b9aca00 is SIZE in hexadecimal, 0x80496dc must be the address of global. I checked with a newer compiler, and this doesn't happen anymore. The difference in the second pair of snippets is similar however.

    void global1()
    {
        int i;
        global = 0;
        for (i = 0; i < SIZE; i++)
            global++;
    }

    void local1()
    {
        int i;
        long tmp = 0;
        for (i = 0; i < SIZE; i++)
            tmp++;
        global = tmp;
    }

    080483d0 <global1>:
     80483d0:   55                      push   %ebp
     80483d1:   89 e5                   mov    %esp,%ebp
     80483d3:   c7 05 dc 96 04 08 00    movl   $0x0,0x80496dc
     80483da:   00 00 00 
     80483dd:   b8 ff c9 9a 3b          mov    $0x3b9ac9ff,%eax
     80483e2:   89 f6                   mov    %esi,%esi
     80483e4:   83 e8 19                sub    $0x19,%eax
     80483e7:   79 fb                   jns    80483e4 <global1+0x14>
     80483e9:   c7 05 dc 96 04 08 00    movl   $0x3b9aca00,0x80496dc
     80483f0:   ca 9a 3b 
     80483f3:   c9                      leave  
     80483f4:   c3                      ret    
     80483f5:   8d 76 00                lea    0x0(%esi),%esi

    080483f8 <local1>:
     80483f8:   55                      push   %ebp
     80483f9:   89 e5                   mov    %esp,%ebp
     80483fb:   b8 ff c9 9a 3b          mov    $0x3b9ac9ff,%eax
     8048400:   48                      dec    %eax
     8048401:   79 fd                   jns    8048400 <local1+0x8>
     8048403:   c7 05 dc 96 04 08 00    movl   $0x3b9aca00,0x80496dc
     804840a:   ca 9a 3b 
     804840d:   c9                      leave  
     804840e:   c3                      ret    
     804840f:   90                      nop    

最后这里是采用-O3剩余的片段,现在GCC 4.3.3产生的code(虽然老版本似乎产生类似code)。它看起来像确实global2(..)编译为在循环,其中local2(..)使用一个寄存器的每次迭代访问全局存储器位置的功能。为什么使用寄存器反正GCC不会优化全球版本现在还不清楚给我。这只是一个缺乏的功能,否则真的会导致可执行文件的不可接受的行为?

Finally here is the code of the remaining snippets, now generated by gcc 4.3.3 using -O3 (though the old version seems to generate similar code). It looks like indeed global2(..) compiles to a function accessing the global memory location in every iteration of the loop, where local2(..) uses a register. It is still not clear to me why gcc wouldn't optimize the global version using a register anyway. Is this just a lacking feature, or would it really lead to unacceptable behaviour of the executable?

    void global2(const char* start)
    {
        const char* p;
        global = 0;
        for (p = start; p < start + SIZE; p++)
            global += *p;
    }

    void local2(const char* start)
    {
        const char* p;
        long tmp = 0;
        for (p = start; p < start + SIZE; p++)
            tmp += *p;
        global = tmp;
    }

    08048470 <global2>:
     8048470:   55                      push   %ebp
     8048471:   31 d2                   xor    %edx,%edx
     8048473:   89 e5                   mov    %esp,%ebp
     8048475:   8b 4d 08                mov    0x8(%ebp),%ecx
     8048478:   c7 05 24 a0 04 08 00    movl   $0x0,0x804a024
     804847f:   00 00 00 
     8048482:   8d b6 00 00 00 00       lea    0x0(%esi),%esi
     8048488:   0f be 04 11             movsbl (%ecx,%edx,1),%eax
     804848c:   83 c2 01                add    $0x1,%edx
     804848f:   01 05 24 a0 04 08       add    %eax,0x804a024
     8048495:   81 fa 00 ca 9a 3b       cmp    $0x3b9aca00,%edx
     804849b:   75 eb                   jne    8048488 <global2+0x18>
     804849d:   5d                      pop    %ebp
     804849e:   c3                      ret    
     804849f:   90                      nop    

    080484a0 <local2>:
     80484a0:   55                      push   %ebp
     80484a1:   31 c9                   xor    %ecx,%ecx
     80484a3:   89 e5                   mov    %esp,%ebp
     80484a5:   31 d2                   xor    %edx,%edx
     80484a7:   53                      push   %ebx
     80484a8:   8b 5d 08                mov    0x8(%ebp),%ebx
     80484ab:   90                      nop    
     80484ac:   8d 74 26 00             lea    0x0(%esi,%eiz,1),%esi
     80484b0:   0f be 04 13             movsbl (%ebx,%edx,1),%eax
     80484b4:   83 c2 01                add    $0x1,%edx
     80484b7:   01 c1                   add    %eax,%ecx
     80484b9:   81 fa 00 ca 9a 3b       cmp    $0x3b9aca00,%edx
     80484bf:   75 ef                   jne    80484b0 <local2+0x10>
     80484c1:   5b                      pop    %ebx
     80484c2:   89 0d 24 a0 04 08       mov    %ecx,0x804a024
     80484c8:   5d                      pop    %ebp
     80484c9:   c3                      ret    
     80484ca:   8d b6 00 00 00 00       lea    0x0(%esi),%esi

感谢。

推荐答案

一个局部变量 TMP 其地址不是由指针所不能指向 p ,编译器可以相应地优化。这是更难以推断全局变量全球未指出,除非它是静态,因为地址全球变量可以采取在另一个编译单元和周围通过。

A local variable tmp whose address is not taken cannot be pointed to by the pointer p, and the compiler can optimize accordingly. It is much more difficult to infer that a global variable global is not pointed to, unless it's static, because the address of that global variable could be taken in another compilation unit and passed around.

如果读过大会表明,编译器会强迫自己从内存往往比你所期望的负载,并且你知道,它担心在实践中不可能存在的别名,你可以通过复制全局变量到本地帮助它在函数的顶部变量和函数的其余部分仅使用本地

If reading the assembly indicates that the compiler forces itself to load from memory more often than you would expect, and you know that the aliasing it worries about cannot exist in practice, you can help it by copying the global variable into a local variable at the top of the function and using only the local in the rest of the function.

最后,注意,如果指针 P 过另一种类型的,编译器可以援引严格别名规则,不管它没有能力来优化推断 p 不指向全球。但由于类型字符的左值通常被用来观察其他类型的再presentation,对于这种别名的津贴,以及编译器不能借此快捷方式在你的例子。

Finally, note that if pointer p had been of another type, the compiler could have invoked "strict aliasing rules" to optimize regardless of its inability to infer that p does not point to global. But because lvalues of type char are often used to observe the representation of other types, there is an allowance for this kind of alias, and the compiler cannot take this shortcut in your example.

这篇关于在不同的优化级别在gcc / g ++的访问本地与全局变量的速度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆