是gcc的ASM挥发性相当于gfortran默认设置递归? [英] Is gcc's asm volatile equivalent to the gfortran default setting for recursions?

查看:417
本文介绍了是gcc的ASM挥发性相当于gfortran默认设置递归?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我只是用递归函数玩弄于 C ++ Fortran语言,我意识到,一个简单的递归函数 Fortran语言几乎快两倍,它的等效 C ++ 功能。现在,进入这之前,我知道有类似的问题在这里,具体有:


  1. Why不加评论装配导致产生code这样彻底的改变?

  2. 工作的 ASM 挥发性( :内存)

  3. 相当于在汇编中gfortran
  4. 挥发性

不过,我有点更加具体和困惑的是,Fortran编译器似乎是在做什么,你可以实现 ASM挥发性 GCC 。为了给你一些背景,让我们考虑下面的递归 Fibonacci数实施

的Fortran code:

 模块的测试
隐无
私人的
公共FIB包含!斐波那契功能
整数递归函数FIB(N)结果(R)
    整数,意图(中)::ñ
    如果(正2)然后
        R =正
    其他
        R = FIB(N-1)+ FIB(N-2)
    万一
最终的功能!斐波那契函数结束
前端模块斐波纳契程序
使用测试,只:FIB
隐无
整数:: R,I
整数:: N = 1e09
实(8)::开始,结束,cum_timecum_time = 0
做I = 1,N
    调用CPU_TIME(开始)
    R = FIB(20)
    电话CPU_TIME(完成)
    cum_time = cum_time +(完成 - 开始)
    如果(cum_time> 0.5)退出
ENDDO打印*,我,'运行,平均时间是',cum_time / I / 1E-06,我们
程序结束

与编译:

  gfortran -O3 -march =本地

C ++ code:

 的#include<&iostream的GT;
#包括LT&;&计时GT;
使用命名空间std;//蛋白原功能
INT FIB(const int的N)
{
    INT R;
    如果(正2)
        R = N;
    其他
        R = FIB(N-1)+ FIB(N-2);
    返回ř;
} // FIB结束模板< typename的T,typename的参数... args>
双timeit(T(* FUNC)(参数...),参数数量参数... args)
{
    双计数器= 1.0;
    双mean_time = 0.0;
    为(自动ITER = 0;&ITER LT; 1e09 ++ ITER){
        的std ::时辰:: time_point<的std ::时辰:: SYSTEM_CLOCK>开始,结束;
        开始=的std ::时辰:: SYSTEM_CLOCK ::现在();        FUNC(参数...);        最终=的std ::时辰:: SYSTEM_CLOCK ::现在();
        的std ::时辰::持续时间LT;双> ELAPSED_SECONDS =年底启动;        mean_time + = elapsed_seconds.count();
        反++;        如果(mean_time> 0.5){
            mean_time / =计数器;
            性病::法院LT&;<的static_cast<长整型>(计数器)
            << 运行时,平均时间是
            << mean_time / 1.0E-06<< \\ XC2 \\ xB5s<<的std :: ENDL;
            打破;
        }
    }
    返回mean_time;
}诠释主(){
    timeit(FIB,20);
    返回0;
}

与编译:

  G ++ -O3 -march =本地

时间:

 的Fortran:24991运行时,平均时间为20.087我们
C ++:12355运行时,平均时间为40.471微秒

所以 gfortran 是两次相比, gcc作为快速有。看着组装code,我得到

大会(Fortran语言):

  .L28:
    CMPL $ 1,%r13d
    JLE .L29
    莱亚尔-8(%RBX),EAX%
    MOVL%ecx中,12(%RSP)
    MOVL%eax中,48(%RSP)
    leaq 48(%RSP),%RDI
    莱亚尔-9(%RBX),EAX%
    MOVL%eax中,16(%RSP)
    调用__bench_MOD_fib
    leaq 16(%RSP),%RDI
    MOVL%eax中,%r13d
    调用__bench_MOD_fib
    MOVL 12(%RSP),%ecx中
    ADDL%eax中,%r13d

组件(C ++):

  .L28:
    MOVL 72(%RSP),EDX%
    CMPL $ 1,EDX%
    MOVL%EDX,EAX%
    JLE .L33
    subl $ 3%EAX
    MOVL $ 0 52(%RSP)
    MOVL%EAX,ESI%
    MOVL%eax中,96(%RSP)
    MOVL 92(%RSP),EAX%
    SHRL%EAX
    MOVL%eax中,128(%RSP)
    ADDL%EAX,EAX%
    subl%EAX,ESI%
    MOVL%EDX,EAX%
    subl $ 1,%eax中
    MOVL%ESI,124(%RSP)
    MOVL%eax中,76(%RSP)

这两个组装codeS是由几乎相似的块/标签重复了一遍又一遍。正如你所看到的Fortran装配使得两次调用 FIB 的功能,而在C ++组件, GCC 可能已展开所有递归调用这可能需要更多的堆栈 PUSH / POP 和尾巴跳跃。

现在,如果我只是把一只内联汇编注释在C ++ code像这样

修改C ++ code:

  //蛋白原功能
INT FIB(const int的N)
{
    INT R;
    如果(正2)
        R = N;
    其他
        R = FIB(N-1)+ FIB(N-2);
    ASM();
    返回ř;
} // FIB结束

生成的程序集code,更改

大会(C ++修改):

  .L7:
    CMPL $ 1,EDX%
    JLE .L17
    莱亚尔-4(%RBX),%r13d
    莱亚尔-5(%RBX),EDX%
    CMPL $ 1,%r13d
    JLE .L19
    莱亚尔-5(%RBX),%r14d
    CMPL $ 1,%r14d
    JLE .L55
    莱亚尔-6(%RBX),%r13d
    MOVL%r13d,EDI%
    调用_Z3fibi
    莱亚尔-7(%RBX),EDI%
    MOVL%eax中,%r15d
    调用_Z3fibi
    MOVL%r13d,EDI%
    ADDL%eax中,%r15d

您现在可以看到两次调用 FIB 功能。时序他们给了我

时间:

 的Fortran:24991运行时,平均时间为20.087我们
C ++:25757运行时,平均时间为19.412微秒

我知道 ASM 的无输出效果和 ASM挥发性是燮preSS激进的编译器的优化但在这种情况下, GCC 认为这是太聪明,但最终产生在首位效率较低code。

所以,问题是


  • 为什么 GCC 看不到这种优化,当 gfortan 显然可以吗?

  • 内联组装线必须是return语句之前。把它放在其他地方,它不会有任何效果。为什么呢?

  • 这种行为是具体的编译器?例如,您可以模拟天生铿锵/ MSVC?
  • 相同的行为
  • 是否有更安全的方式,使递归更快 C C ++ (不依靠内联汇编或iteration-风格编码)?可变参数模板吧?

更新


  • 上面显示的结果都与 GCC 4.8.4 。我也试图与 GCC 4.9.2 GCC 5.2 ,我也得到相同的结果。
  • 编译它
  • 这个问题也可以复制(固定?)如果不是放在 ASM 我宣布输入参数为volatile即(挥发性INT N) 而不是(const int的N),虽然这将导致一个稍微有点慢运行时我的机器上。

  • 凯驰已经提到的,我们可以改为通过 -fno-优化同胞-calls 标记来解决这个问题。由于该标志是在 -O2 激活水平和超越,甚至与 -O1 编译修复这个问题。

  • 我已经运行在同样的例子铛3.5.1 -O3 -march =本地键,虽然情况是不相同的,似乎还生成 ASM 更快code。

锵时间:

 铛++ W / O的ASM:8846运行时,平均时间为56.4555微秒
铛++与ASM:10427运行时,平均时间为47.8991微秒


解决方案

请参阅接近这个答案如何让GCC所生成的快速程序结束的黑体字。阅读答复四个问题的答案。

您的第一个问题假设 gfortran 是能够看到一个优化的可能性, GCC 看不到。事实上,情况正好相反。 GCC 确定的东西,它被认为是一个优化的可能性,而 gfortran 错过了。唉, GCC 是错误的,它采用的优化原来是你的系统上的100%的速度损失(对矿井可比性)。

要解决你的第二个问题: ASM 语句prevented,使得内部改造 GCC 看假的优化方法可行。如果没有 ASM 语句,你的code得到了转化为(有效):

  INT FIB(const int的N)
{
    如果(正2)
        返回N;
    其他
        返回FIB(N-1)+ FIB(N-2);
}

包含递归调用return语句会触发兄弟叫优化那你pessimizes code。包括汇编语句prevents两端移动返回指令。

目前,我只手头的gcc,所以我不能尝试其他的编译器来回答你的证据的第三个问题的行为,但这似乎绝对依赖编译器。你碰到了一个怪癖(或错误,但你怎么称呼它)的GCC,它会产生不好的code,而试图去优化它。不同的编译器的优化是非常不同的,所以它很可能其他一些编译器不误优化code像 GCC 一样。而另一方面,$ C优化$ C的转换是一个精心研究的话题,大多数编译器正在实施类似的方法来优化,所以它可能是另一个编译步入同样的陷阱为 GCC

要解决你的最后一个问题:这是不是一个关于C / C ++与FORTAN问题,而是一个关于 GCC 这弄乱这个例子程序(并有可能类似的生产问题程式)。所以没有办法的做递归更快 C ++ 的,但有一种方法可以加快这个例子中的 GCC 的,通过禁用有问题的优化: -fno-优化同胞通话 这结果(我的系统,在一个单一的测试运行上)在更快的code不仅仅是插入 ASM 语句。

I was just playing around with recursive functions in C++ and Fortran and I realised that a simple recursive function in Fortran is almost twice as fast as its equivalent C++ function. Now, before getting into this, I know there are similar questions here, specifically:

  1. Why does adding assembly comments cause such radical change in generated code?
  2. Working of asm volatile ("" : : : "memory")
  3. Equivalent to asm volatile in gfortran

However, I am a bit more specific and puzzled as the Fortran compiler seems to be doing what you can achieve with asm volatile in gcc. To give you some context let's consider the following recursive Fibonacci number implementation:

Fortran Code:

module test
implicit none
private
public fib

contains

! Fibonacci function
integer recursive function fib(n) result(r)
    integer, intent(in) :: n
    if (n < 2) then
        r = n
    else
        r = fib(n-1) + fib(n-2)
    end if
end function  ! end of Fibonacci function
end module

program fibonacci
use test, only: fib
implicit none
integer :: r,i 
integer :: n = 1e09
real(8) :: start, finish, cum_time

cum_time=0
do i= 1,n 
    call cpu_time(start)
    r = fib(20)
    call cpu_time(finish) 
    cum_time = cum_time + (finish - start)
    if (cum_time >0.5) exit
enddo  

print*,i,'runs, average elapsed time is', cum_time/i/1e-06, 'us' 
end program

Compiled with:

gfortran -O3 -march=native

C++ Code:

#include <iostream>
#include <chrono>
using namespace std;

// Fib function
int fib(const int n)
{
    int r;
    if (n < 2)
        r = n;
    else
        r = fib(n-1) + fib(n-2);
    return r;
} // end of fib

template<typename T, typename ... Args>
double timeit(T (*func)(Args...), Args...args)
{
    double counter = 1.0;
    double mean_time = 0.0;
    for (auto iter=0; iter<1e09; ++iter){
        std::chrono::time_point<std::chrono::system_clock> start, end;
        start = std::chrono::system_clock::now();

        func(args...);

        end = std::chrono::system_clock::now();
        std::chrono::duration<double> elapsed_seconds = end-start;

        mean_time += elapsed_seconds.count();
        counter++;

        if (mean_time > 0.5){
            mean_time /= counter;
            std::cout << static_cast<long int>(counter)
            << " runs, average elapsed time is "
            << mean_time/1.0e-06 << " \xC2\xB5s" << std::endl; 
            break;
        }
    }
    return mean_time;
}

int main(){
    timeit(fib,20);
    return 0;
}

Compiled with:

g++ -O3 -march=native

Timing:

Fortran: 24991 runs, average elapsed time is 20.087 us
C++    : 12355 runs, average elapsed time is 40.471 µs

So gfortran is twice as fast there compared to gcc. Looking at the assembly code, I get

Assembly (Fortran):

.L28:
    cmpl    $1, %r13d
    jle .L29
    leal    -8(%rbx), %eax
    movl    %ecx, 12(%rsp)
    movl    %eax, 48(%rsp)
    leaq    48(%rsp), %rdi
    leal    -9(%rbx), %eax
    movl    %eax, 16(%rsp)
    call    __bench_MOD_fib
    leaq    16(%rsp), %rdi
    movl    %eax, %r13d
    call    __bench_MOD_fib
    movl    12(%rsp), %ecx
    addl    %eax, %r13d

Assembly (C++):

.L28:
    movl    72(%rsp), %edx
    cmpl    $1, %edx
    movl    %edx, %eax
    jle .L33
    subl    $3, %eax
    movl    $0, 52(%rsp)
    movl    %eax, %esi
    movl    %eax, 96(%rsp)
    movl    92(%rsp), %eax
    shrl    %eax
    movl    %eax, 128(%rsp)
    addl    %eax, %eax
    subl    %eax, %esi
    movl    %edx, %eax
    subl    $1, %eax
    movl    %esi, 124(%rsp)
    movl    %eax, 76(%rsp)

Both assembly codes are made up of almost similar blocks/labels repeated over and over again. As you can see the Fortran assembly makes two calls to fib function whereas in the C++ assembly, gcc has probably unrolled all the recursive calls which probably requires more stack push/pop and tail jumps.

Now if I just put one inline assembly comment in the C++ code like so

Modified C++ Code:

// Fib function
int fib(const int n)
{
    int r;
    if (n < 2)
        r = n;
    else
        r = fib(n-1) + fib(n-2);
    asm("");
    return r;
} // end of fib

The generated assembly code, changes to

Assembly (C++ Modified):

.L7:
    cmpl    $1, %edx
    jle .L17
    leal    -4(%rbx), %r13d
    leal    -5(%rbx), %edx
    cmpl    $1, %r13d
    jle .L19
    leal    -5(%rbx), %r14d
    cmpl    $1, %r14d
    jle .L55
    leal    -6(%rbx), %r13d
    movl    %r13d, %edi
    call    _Z3fibi
    leal    -7(%rbx), %edi
    movl    %eax, %r15d
    call    _Z3fibi
    movl    %r13d, %edi
    addl    %eax, %r15d

You can now see two calls to fib function. Timing them gives me

Timing:

Fortran: 24991 runs, average elapsed time is 20.087 us
C++    : 25757 runs, average elapsed time is 19.412 µs

I know the effect of asm with no output and asm volatile is to suppress aggressive compiler optimisations but in this case, gcc thought it was too smart but ended up generating a less efficient code in the first place.

So the question is:

  • Why can gcc not see this "optimisation", when gfortan clearly can?
  • The inline assembly line has to be before the return statement. Put it elsewhere and it will have no effect. Why?
  • Is this behaviour compiler specific? For instance can you mimick the same behaviour with clang/MSVC?
  • Are there safer ways to make recursions faster in C or C++ (without relying on inline assembly or iteration-style coding)? Variadic templates perhaps?

UPDATE:

  • The result shown above are all with gcc 4.8.4. I have also tried compiling it with gcc 4.9.2 and gcc 5.2 and I get identical results.
  • The problem can also be replicated (fixed?) if instead of putting asm I declare the input argument as volatile i.e. (volatile int n) instead of (const int n), although this will result in a tad bit slower run-time on my machine.
  • As Michael Karcher has mentioned, we can instead pass the -fno-optimize-sibling-calls flag to fix this problem. Since this flag is activated at -O2 level and beyond, even compiling with -O1 fixes this problem.
  • I have run the same example with clang 3.5.1 with -O3 -march=native and though the situation is not identical, clang also appears to generate faster code with asm.

Clang Timing:

clang++ w/o asm    :  8846 runs, average elapsed time is 56.4555 µs
clang++ with asm   : 10427 runs, average elapsed time is 47.8991 µs 

解决方案

See the bold print near the end of this answer on how to get a fast program generated by gcc. Read the answer for replies to the four questions.

Your first question assumes that gfortran is able to see an optimization possibility that gcc failed to see. In fact, the opposite is true. gcc identified something that it considered to be an optimization possibility, while gfortran missed it. Alas, gcc was wrong and the optimization it applied turns out to be a 100% speed loss on your system (comparable on mine).

To address your second question: The asm statement prevented an internal transformation that made gcc see the false optimization possiblity. Without the asm statement, your code got (validly) transformed to:

int fib(const int n)
{
    if (n < 2)
        return n;
    else
        return fib(n-1) + fib(n-2);
}

The return statement containing the recursive call triggers the "sibling calls optimization" that pessimizes your code. Including the asm statement prevents moving the return instruction across it.

Currently, I only have gcc at hand, so I can't try the behaviour of other compilers to answer your third question by evidence, but this seems definitely compiler dependent. You ran into a quirk (or bug, however you call it) of gcc that it generates bad code while trying to optimize it. The optimizers of different compilers are highly different, so it is quite likely that some other compiler doesn't mis-optimize your code like gcc does. On the other hand, code transformations for optimization is a well-researched topic, and most compilers are implementing similar approaches to optimization, so it is possible that another compiler steps into the same trap as gcc.

To address you last question: It is not a problem about C/C++ versus Fortan, but a problem about gcc that messes up this example program (and likely similar production programs). So there is no way to make recursion faster in C++, but there is a way to speed up this example in gcc, by disabling the problematic optimization: -fno-optimize-sibling-calls, which results (on my system, in a single test run) in even faster code than just inserting the asm statement.

这篇关于是gcc的ASM挥发性相当于gfortran默认设置递归?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆