在斐波纳契微基准测试中,改善Haskell的性能与C相比 [英] On improving Haskell's performance compared to C in fibonacci micro-benchmark

查看:112
本文介绍了在斐波纳契微基准测试中,改善Haskell的性能与C相比的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我遇到了这个问题,它比较了各种编译器在计算上的性能斐波那契数字是一种天真的方式。



我试着用Haskell做这件事,看它与C相比如何。



C代码:

  #include< stdio.h> 
#include< stdlib.h>

int fib(int n){
if(n <2)return 1;
return fib(n-1)+ fib(n-2);
}

int main(int argc,char * argv []){
printf(%i\\\
,fib(atoi(argv [1])) );
返回0;

结果:

 > gcc -O3 main.c -o fib 
> time ./fib 40
165580141
real 0m0.421s
user 0m0.420s
sys 0m0.000s

Haskell:

 模块Main其中
import System.Environment (getArgs)

fib :: Int - > Int
fib n | n< 2 = 1
|否则= fib(n-1)+ fib(n-2)

main = getArgs>> = print。 fib。阅读。头部

结果:

 > ghc -O3 -fllvm -optlo-O3 Main.hs -o fib 
> time ./fib 40
165580141
real 0m1.476s
user 0m1.476s
sys 0m0.000s

使用

 >> ghc -O3 -fllvm -optlo-O3 -prof -auto-all -caf-all -rtsopts Main.hs -fforce-recomp -o fib 
> ./fib 40 + RTS -prof

显示 fib 需要100%的时间和分配,并不奇怪。我收集了一些堆的资料,但不知道它们的含义:

 > ./fib 40 + RTS -hc 

 > ./fib 40 + RTS -hd 


所以我的问题:有什么我可以从我身边做这个Haskell程序的性能更接近于C,或者这只是GHC在这个微基准测试中发生的事情,使它变慢的原因? (我并不是要求渐近式算法来计算fibs。)

非常感谢。



$ b

事实证明, ghc -O3 ghc快 - O3 -fllvm -optlo-O3 在这种情况下。但是 optlo-block-placement 对LLVM后端进行了可观察的改变:

 > ghc -O3 Main.hs -o fib -fforce-recomp 
> time ./fib 40
165580141
real 0m1.283s
user 0m1.284s
sys 0m0.000s

> ghc -O3 -fllvm -optlo-O3 -o fib -fforce-recomp
> time ./fib 40
165580141
real 0m1.449s
user 0m1.448s
sys 0m0.000s

> ghc -O3 -fllvm -optlo-O3 -optlo-block-placement -o fib -fforce-recomp
> time ./fib 40
165580141
real 0m1.112s
user 0m1.096s
sys 0m0.016s

我想调查这个问题的原因是因为C和OCaml在这个程序中比Haskell快得多。我有点无法接受,并想了解更多,以确保我已经尽我所能:D / b

 > ocamlopt main.ml -o fib 
> time ./fib 40
165580141
real 0m0.668s
user 0m0.660s
sys 0m0.008s


解决方案

堆配置文件在这里并不是很有意思,因为GHC将 fib 编译为这是一个在堆栈上运行唯一功能的函数。看看配置文件......只有800字节被分配,这是 main 实现的小开销。



就GHC的核心水平而言,实际上它会尽可能地得到优化。低级代码生成是另一回事。让我们快速浏览GHC生成的代码:

  _Main_zdwfib_info:
.Lc1RK:
leal -8(%ebp),%eax
cmpl 84(%ebx),%eax
jb .Lc1RM

这是对堆栈空间的检查。可能C不需要的东西,因为它可以让操作系统处理堆栈空间分配。 Haskell有用户级线程,所以堆栈空间是手动管理的。

  cmpl $ 2,0(%ebp)
jl .Lc1RO

与代码中的2进行比较。

  movl 0(%ebp),%eax 
decl%eax

该参数从堆栈重新加载并递减以获取递归调用的参数。重新加载可能是不必要的 - 不知道它有什么不同。

  movl%eax,-8(%ebp)
movl $ _s1Qp_info,-4(%ebp)
addl $ -8,%ebp
jmp _Main_zdwfib_info

参数和返回地址被压入堆栈顶部,我们直接跳到标签以缓存。



<$ p $ .Lc1RO:
movl $ 1,%esi
addl $ 4,%ebp
jmp * 0(%ebp)

参数小于2的情况下的代码。返回值在寄存器中传递。



底线:一切似乎都应该像它应该的那样,通过改变程序,你不太可能挤出更多。自定义堆栈检查是一个明显的减速源,不确定它是否可以归咎于全时差。


I came across this question, which compared the performance of various compilers on computing fibonaci numbers the naive way.

I tried doing this with Haskell to see how it compares to C.

C code:

#include <stdio.h>
#include <stdlib.h>

int fib (int n) {
  if (n < 2) return 1;
  return fib (n-1) + fib (n-2);
}

int main (int argc, char* argv[]) {
  printf ("%i\n", fib (atoi(argv[1])));
  return 0;
}

Result:

> gcc -O3 main.c -o fib
> time ./fib 40
165580141
real    0m0.421s
user    0m0.420s
sys 0m0.000s

Haskell:

module Main where
import System.Environment (getArgs)

fib :: Int -> Int
fib n | n < 2 = 1
      | otherwise = fib (n-1) + fib (n-2)

main = getArgs >>= print . fib . read . head

Result:

> ghc -O3 -fllvm -optlo-O3 Main.hs -o fib
> time ./fib 40
165580141
real    0m1.476s
user    0m1.476s
sys 0m0.000s

Profiling with

> ghc -O3 -fllvm -optlo-O3 -prof -auto-all -caf-all -rtsopts Main.hs -fforce-recomp -o fib
> ./fib 40 +RTS -prof

reveals that fib takes 100% time and alloc, no surprise. I took some profile of the heap, but don't know what they imply:

> ./fib 40 +RTS -hc

> ./fib 40 +RTS -hd

So my question: Is there anything I can do from my side to make this Haskell program's performance closer to C, or this is just the way GHC does things that happens to make it slower in this micro-benchmark? (I'm not asking for an asymptotically faster algorithm to compute fibs.)

Thank you very much.

[EDIT]

It turned out that ghc -O3 was faster than ghc -O3 -fllvm -optlo-O3 in this case. But optlo-block-placement made an observable difference for the LLVM backend:

> ghc -O3 Main.hs -o fib -fforce-recomp
> time ./fib 40
165580141
real    0m1.283s
user    0m1.284s
sys 0m0.000s

> ghc -O3 -fllvm -optlo-O3 -o fib -fforce-recomp
> time ./fib 40
165580141
real    0m1.449s
user    0m1.448s
sys 0m0.000s

> ghc -O3 -fllvm -optlo-O3 -optlo-block-placement -o fib -fforce-recomp
> time ./fib 40
165580141
real    0m1.112s
user    0m1.096s
sys 0m0.016s

The reason I wanted to investigate this was because both C and OCaml were significantly faster than Haskell for this program. I sort of couldn't accept that and wanted to learn more to make sure I already did everything I could :D

> ocamlopt main.ml -o fib
> time ./fib 40
165580141
real    0m0.668s
user    0m0.660s
sys 0m0.008s

解决方案

The heap profile is not very interesting here, as GHC compiles fib into a function that operates soleley on the stack. Just look at the profile... Only 800 bytes are ever allocated, the small overhead of your main implementation.

As far as GHC's core-level is concerned, this actually gets optimized as far as it can get. The low-level code generation is another matter though. Let us have a quick dive into the code GHC generates:

_Main_zdwfib_info:
.Lc1RK:
    leal -8(%ebp),%eax
    cmpl 84(%ebx),%eax
    jb .Lc1RM

This is the check for stack space. Probably something C doesn't need, as it can let the operation system handle stack space allocation. Haskell has user level threads, so stack space is managed manually.

    cmpl $2,0(%ebp)
    jl .Lc1RO

The comparison against 2 from your code.

    movl 0(%ebp),%eax
    decl %eax

The parameter is reloaded from the stack and decremented to get the parameter for the recursive call. The reload is probably unnecessary - not sure it makes a difference though.

    movl %eax,-8(%ebp)
    movl $_s1Qp_info,-4(%ebp)
    addl $-8,%ebp
    jmp _Main_zdwfib_info

The parameter and the return address are pushed on top of the stack, and we jump directly to the label in order to recurse.

.Lc1RO:
    movl $1,%esi
    addl $4,%ebp
    jmp *0(%ebp)

The code for the case that the parameter is less than 2. The return value is passed in a register.

Bottom line: Everything seems to be working like it should, it's unlikely you will be able to squeeze much more out of this by changing the program. The custom stack check is an obvious source of slowdowns, not sure whether it can be blamed for the full time difference though.

这篇关于在斐波纳契微基准测试中,改善Haskell的性能与C相比的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆