在斐波纳契微基准测试中，改善Haskell的性能与C相比 [英] On improving Haskell's performance compared to C in fibonacci micro-benchmark

查看：112 发布时间：2018/6/4 17:25:04 performance haskell ghc micro-optimization microbenchmark

本文介绍了在斐波纳契微基准测试中，改善Haskell的性能与C相比的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我遇到了这个问题，它比较了各种编译器在计算上的性能斐波那契数字是一种天真的方式。

我试着用Haskell做这件事，看它与C相比如何。

C代码：

  #include< stdio.h> 
 #include< stdlib.h> 
 
 int fib（int n）{
 if（n <2）return 1; 
 return fib（n-1）+ fib（n-2）; 
} 
 
 int main（int argc，char * argv []）{
 printf（％i\\\
，fib（atoi（argv [1]）） ）; 
返回0;

结果：

 > gcc -O3 main.c -o fib 
> time ./fib 40 
 165580141 
 real 0m0.421s 
 user 0m0.420s 
 sys 0m0.000s 
   
  Haskell：
 
 
 模块Main其中
 import System.Environment （getArgs）
 
 fib :: Int  - > Int 
 fib n | n< 2 = 1 
 |否则= fib（n-1）+ fib（n-2）
 
 main = getArgs>> = print。 fib。阅读。头部
  
结果： 
 
 
 > ghc -O3 -fllvm -optlo-O3 Main.hs -o fib 
> time ./fib 40 
 165580141 
 real 0m1.476s 
 user 0m1.476s 
 sys 0m0.000s 
  
使用 
 
 >> ghc -O3 -fllvm -optlo-O3 -prof -auto-all -caf-all -rtsopts Main.hs -fforce-recomp -o fib 
> ./fib 40 + RTS -prof 
  
显示 fib 需要100％的时间和分配，并不奇怪。我收集了一些堆的资料，但不知道它们的含义： 
 
 
 > ./fib 40 + RTS -hc 
  
  
 > ./fib 40 + RTS -hd 
  
  
 
所以我的问题：有什么我可以从我身边做这个Haskell程序的性能更接近于C，或者这只是GHC在这个微基准测试中发生的事情，使它变慢的原因？ （我并不是要求渐近式算法来计算fibs。）
 
 
非常感谢。
 
 
  $ b 
事实证明， ghc -O3 比 ghc快 - O3 -fllvm -optlo-O3 在这种情况下。但是 optlo-block-placement 对LLVM后端进行了可观察的改变： 
 
 
 > ghc -O3 Main.hs -o fib -fforce-recomp 
> time ./fib 40 
 165580141 
 real 0m1.283s 
 user 0m1.284s 
 sys 0m0.000s 
 
> ghc -O3 -fllvm -optlo-O3 -o fib -fforce-recomp 
> time ./fib 40 
 165580141 
 real 0m1.449s 
 user 0m1.448s 
 sys 0m0.000s 
 
> ghc -O3 -fllvm -optlo-O3 -optlo-block-placement -o fib -fforce-recomp 
> time ./fib 40 
 165580141 
 real 0m1.112s 
 user 0m1.096s 
 sys 0m0.016s 
  
我想调查这个问题的原因是因为C和OCaml在这个程序中比Haskell快得多。我有点无法接受，并想了解更多，以确保我已经尽我所能：D / b 
 
 
 > ocamlopt main.ml -o fib 
> time ./fib 40 
 165580141 
 real 0m0.668s 
 user 0m0.660s 
 sys 0m0.008s 
  
 
 
解决方案
堆配置文件在这里并不是很有意思，因为GHC将 fib 编译为这是一个在堆栈上运行唯一功能的函数。看看配置文件......只有800字节被分配，这是 main 实现的小开销。
 
 
 就GHC的核心水平而言，实际上它会尽可能地得到优化。低级代码生成是另一回事。让我们快速浏览GHC生成的代码： 
 
 
  _Main_zdwfib_info：
 .Lc1RK：
 leal -8（％ebp），％eax 
 cmpl 84（％ebx），％eax 
 jb .Lc1RM 
  
这是对堆栈空间的检查。可能C不需要的东西，因为它可以让操作系统处理堆栈空间分配。 Haskell有用户级线程，所以堆栈空间是手动管理的。
  cmpl $ 2,0（％ebp）
 jl .Lc1RO 
  
与代码中的2进行比较。 
  movl 0（％ebp），％eax 
 decl％eax 
  
该参数从堆栈重新加载并递减以获取递归调用的参数。重新加载可能是不必要的 - 不知道它有什么不同。
  movl％eax，-8（％ebp）
 movl $ _s1Qp_info，-4（％ebp）
 addl $ -8，％ebp 
 jmp _Main_zdwfib_info 
  
参数和返回地址被压入堆栈顶部，我们直接跳到标签以缓存。
 
 
 <$ p $  .Lc1RO：
 movl $ 1，％esi 
 addl $ 4，％ebp 
 jmp * 0（％ebp）

参数小于2的情况下的代码。返回值在寄存器中传递。

底线：一切似乎都应该像它应该的那样，通过改变程序，你不太可能挤出更多。自定义堆栈检查是一个明显的减速源，不确定它是否可以归咎于全时差。

I came across this question, which compared the performance of various compilers on computing fibonaci numbers the naive way.

I tried doing this with Haskell to see how it compares to C.

C code:
#include <stdio.h> #include <stdlib.h> int fib (int n) { if (n < 2) return 1; return fib (n-1) + fib (n-2); } int main (int argc, char* argv[]) { printf ("%i\n", fib (atoi(argv[1]))); return 0; }
Result:
> gcc -O3 main.c -o fib > time ./fib 40 165580141 real 0m0.421s user 0m0.420s sys 0m0.000s
Haskell:
module Main where import System.Environment (getArgs) fib :: Int -> Int fib n | n < 2 = 1 | otherwise = fib (n-1) + fib (n-2) main = getArgs >>= print . fib . read . head
Result:
> ghc -O3 -fllvm -optlo-O3 Main.hs -o fib > time ./fib 40 165580141 real 0m1.476s user 0m1.476s sys 0m0.000s
Profiling with
> ghc -O3 -fllvm -optlo-O3 -prof -auto-all -caf-all -rtsopts Main.hs -fforce-recomp -o fib > ./fib 40 +RTS -prof
reveals that fib takes 100% time and alloc, no surprise. I took some profile of the heap, but don't know what they imply:
> ./fib 40 +RTS -hc
> ./fib 40 +RTS -hd

So my question: Is there anything I can do from my side to make this Haskell program's performance closer to C, or this is just the way GHC does things that happens to make it slower in this micro-benchmark? (I'm not asking for an asymptotically faster algorithm to compute fibs.)

Thank you very much.

[EDIT]

It turned out that ghc -O3 was faster than ghc -O3 -fllvm -optlo-O3 in this case. But optlo-block-placement made an observable difference for the LLVM backend:
> ghc -O3 Main.hs -o fib -fforce-recomp > time ./fib 40 165580141 real 0m1.283s user 0m1.284s sys 0m0.000s > ghc -O3 -fllvm -optlo-O3 -o fib -fforce-recomp > time ./fib 40 165580141 real 0m1.449s user 0m1.448s sys 0m0.000s > ghc -O3 -fllvm -optlo-O3 -optlo-block-placement -o fib -fforce-recomp > time ./fib 40 165580141 real 0m1.112s user 0m1.096s sys 0m0.016s
The reason I wanted to investigate this was because both C and OCaml were significantly faster than Haskell for this program. I sort of couldn't accept that and wanted to learn more to make sure I already did everything I could :D
> ocamlopt main.ml -o fib > time ./fib 40 165580141 real 0m0.668s user 0m0.660s sys 0m0.008s

解决方案
The heap profile is not very interesting here, as GHC compiles fib into a function that operates soleley on the stack. Just look at the profile... Only 800 bytes are ever allocated, the small overhead of your main implementation.

As far as GHC's core-level is concerned, this actually gets optimized as far as it can get. The low-level code generation is another matter though. Let us have a quick dive into the code GHC generates:
_Main_zdwfib_info: .Lc1RK: leal -8(%ebp),%eax cmpl 84(%ebx),%eax jb .Lc1RM
This is the check for stack space. Probably something C doesn't need, as it can let the operation system handle stack space allocation. Haskell has user level threads, so stack space is managed manually.
cmpl $2,0(%ebp) jl .Lc1RO
The comparison against 2 from your code.
movl 0(%ebp),%eax decl %eax
The parameter is reloaded from the stack and decremented to get the parameter for the recursive call. The reload is probably unnecessary - not sure it makes a difference though.
movl %eax,-8(%ebp) movl $_s1Qp_info,-4(%ebp) addl $-8,%ebp jmp _Main_zdwfib_info
The parameter and the return address are pushed on top of the stack, and we jump directly to the label in order to recurse.
.Lc1RO: movl $1,%esi addl $4,%ebp jmp *0(%ebp)
The code for the case that the parameter is less than 2. The return value is passed in a register.

Bottom line: Everything seems to be working like it should, it's unlikely you will be able to squeeze much more out of this by changing the program. The custom stack check is an obvious source of slowdowns, not sure whether it can be blamed for the full time difference though.

这篇关于在斐波纳契微基准测试中，改善Haskell的性能与C相比的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在斐波纳契微基准测试中，改善Haskell的性能与C相比 [英] On improving Haskell's performance compared to C in fibonacci micro-benchmark

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在斐波纳契微基准测试中，改善Haskell的性能与C相比 [英] On improving Haskell&#39;s performance compared to C in fibonacci micro-benchmark

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

在斐波纳契微基准测试中，改善Haskell的性能与C相比 [英] On improving Haskell's performance compared to C in fibonacci micro-benchmark

登录关闭