Haskell:为什么Int的性能比Word64差,为什么我的程序比C慢得多? [英] Haskell: Why does Int perform worse than Word64, and why is my program far slower than C?

查看:100
本文介绍了Haskell:为什么Int的性能比Word64差,为什么我的程序比C慢得多?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在阅读一篇> Haskell与Collat​​z玩游戏的速度有多慢的文章猜想,它基本上说明如果您继续乘以3并在奇数上加一,或者将偶数除以2,则最终会得到一个.例如3-> 10-> 5-> 16-> 8-> 4-> 2-> 1.

I was reading an article of how slow Haskell is in playing with Collatz conjecture, which basically states if you keep multiplying by three and adding one to an odd number, or dividing an even one with two, you will eventually get one. For instance, 3 -> 10 -> 5 -> 16 -> 8 -> 4 -> 2 -> 1.

本文给出的程序是在给定范围内计算最长的Collat​​z序列. C版本是:

The program given in this article is to calculate the longest Collatz sequence in a given range. The C version is:

#include <stdio.h>

int main(int argc, char **argv) {
   int max_a0 = atoi(argv[1]);
   int longest = 0, max_len = 0;
   int a0, len;
   unsigned long a;

   for (a0 = 1; a0 <= max_a0; a0++) {
      a = a0;
      len = 0;

      while (a != 1) {
         len++;
         a = ((a%2==0)? a : 3*a+1)/2;
      }

      if (len > max_len) {
         max_len = len;
         longest = a0;
      }
   }
   printf("(%d, %d)\n", max_len, longest);
   return 0;
}

使用Clang O2进行编译,它在我的计算机上运行0.2s.

Compiling with Clang O2, it runs on my computer for 0.2s.

该文章中给出的Haskell版本显式地将整个序列生成为列表,然后计算中间列表的长度.它比C版本慢10倍.但是,由于作者使用LLVM作为我尚未安装的后端,因此无法重现此内容.使用GHC 7.8和默认后端,它可以在Mac上运行10秒,比C版本慢50倍.

The Haskell version given in that article generates the whole sequence as a list explicitly, and then calculate the length of the intermediate list. It is 10x slower than the C version. However, since the author used LLVM as a backend, which I haven't installed, I couldn't reproduce this. Using GHC 7.8 and the default backend, it runs 10s on my Mac, which is 50x slower than the C version.

然后,我使用尾部递归编写了一个版本,而不生成中间列表:

Then, I wrote a version using tail recursion and not generating an intermediate list:

collatzNext :: Int -> Int
collatzNext a
  | even a    = a `div` 2
  | otherwise = (3 * a + 1) `div` 2

collatzLen :: Int -> Int
collatzLen n = collatzIter n 0
  where
    collatzIter 1 len = len
    collatzIter n len = collatzIter (collatzNext n) (len + 1)

main = do
  print $ maximum $ [collatzLen x | x <- [1..1000000]]

它与GHC 7.8和O2一起编译,运行2秒钟,比C版本慢10倍.

Compiled with GHC 7.8 and O2, it runs for 2s, 10x slower than the C version.

有趣的是,当我将类型注释中的Int更改为Word时,它只花了1秒,快了2倍!

Interestingly, when I changed Int in the type annotation to Word, it only spent 1s, 2x faster!

我已经尝试使用BangPatterns进行严格的评估,但是并没有发现明显的性能提升-我猜GHC的严格分析足够聪明,可以处理这种简单的情况.

I have tried BangPatterns for explicit strict evaluation, but no significant performance gain could be noticed — I guess GHC's strict analysis is smart enough to handle such a simple scenario.

我的问题是:

  1. 为什么Word版本比Int版本快得多?
  2. 为什么Haskell程序与用C编写的程序相比这么慢?
  1. Why is the Word version so much faster compared with the Int one?
  2. Why is this Haskell program so slow compared with that written in C?

推荐答案

该程序的性能取决于几个因素.如果我们都能正确解决所有问题,那么性能将与C程序相同.经历以下因素:

The performance of this program depends on several factors. If we get all of them right, the performance becomes the same as that of the C program. Going through these factors:

1.使用和比较正确的字号

发布的C代码段不完全正确;它在所有体系结构上都使用32位整数,而Haskell Int -s在64位计算机上为64位.在进行其他操作之前,我们应确保在两个程序中使用相同的字长.

The posted C code snippet is not exactly right; it uses 32-bit integers on all architectures, while the Haskell Int-s are 64 bit on a 64-bit machine. Before anything else, we should make sure to use the same word size in both programs.

此外,我们应该始终在我们的Haskell代码中使用 native-size 整数类型.因此,如果我们使用的是64位系统,则应使用64位数字,并避免使用Int32 -s和Word32 -s,除非特别需要它们.这是因为对非本机整数的运算主要实现为外部呼叫而不是primops ,因此它们的运行速度要慢得多.

Also, we should always use native-sized integral types in our Haskell code. So if we're on a 64-bit system, we should use 64-bit numbers, and steer clear of Int32-s and Word32-s, unless there is a specific need for them. This is because operations on non-native integers are mostly implemented as foreign calls rather than primops, so they're significantly slower.

2.在collatzNext

2. Division in collatzNext

divquot慢,因为div处理负数

div is slower than quot for Int, because div handles negative numbers differently. If we use div and switch to Word, the program gets faster, because div is the same as quot for Word. quot with Int works just as fine. However, this is still not as fast as C. We can divide by two by shifting bits to the right. For some reason not even LLVM does this particular strength reduction in this example, so we're best off doing it by hand, replacing quot n 2 by shiftR n 1.

3.检查均匀度

最快的检查方法是检查最低有效位. LLVM可以为此优化even,而本机代码生成器则不能.因此,如果我们使用本机代码生成器,则可以将even n替换为n .&. 1 == 0,这将大大提高性能.

The fastest way to check this is by checking the least significant bit. LLVM can optimize even to this, while the native codegen cannot. So, if we're on native codegen, even n could be replaced with n .&. 1 == 0, and this gives a nice performance boost.

但是,我发现了GHC 7.10的性能错误.在这里,我们没有为Word插入内联的even,这会破坏性能(在代码的最热部分调用带有堆分配的Word框的函数会执行此操作).因此,这里我们应该使用rem n 2 == 0n .&. 1 == 0而不是even.但是,Inteven可以内联.

However, I found something of a performance bug with GHC 7.10. Here we don't get an inlined even for Word, which wrecks performance (calling a function with a heap-allocated Word box in the hottest part of the code does this). So here we should use rem n 2 == 0 or n .&. 1 == 0 instead of even. The even for Int gets inlined fine though.

4.融合collatzLen

4. Fusing away lists in collatzLen

这是一个关键因素.链接的博客文章对此有些过时了. GHC 7.8在这里不能进行融合,但是7.10可以.这意味着使用GHC 7.10和LLVM,我们可以方便地获得类似C的性能,而无需大量修改原始代码.

This is a crucial factor. The linked blog post is a bit out-of-date with regards to this. GHC 7.8 can't do fusion here, but 7.10 can. This means that with GHC 7.10 and LLVM we can conveniently get C-like performance without significantly modifying the original code.

collatzNext a = (if even a then a else 3*a+1) `quot` 2
collatzLen a0 = length $ takeWhile (/= 1) $ iterate collatzNext a0
maxColLen n   = maximum $ map collatzLen [1..n]

main = do
    [n] <- getArgs
    print $ maxColLen (read n :: Int) 

使用ghc-7.10.1 -O2 -fllvmn = 10000000,以上程序在 2.8 秒内运行,而等效的C程序在 2.4 秒内运行.如果我在没有LLVM的情况下编译相同的代码,那么我将获得 12.4 的第二个运行时.这种减速完全是由于even缺乏优化.如果我们使用a .&. 1 == 0,则速度下降将消失.

With ghc-7.10.1 -O2 -fllvm and n = 10000000, the above program runs in 2.8 seconds, while the equivalent C program runs in 2.4 seconds. If I compile the same code without LLVM, then I instead get 12.4 second runtime. This slowdown is entirely because of the lack of optimization on even. If we use a .&. 1 == 0, then the slowdown disappears.

5.计算最大长度时融合列表

GHC 7.10甚至都无法做到这一点,因此我们不得不求助于手动循环编写.

Not even GHC 7.10 can do this, so we have to resort to manual loop-writing.

collatzNext a = (if a .&. 1 == 0 then a else 3*a+1) `shiftR` 1
collatzLen    = length . takeWhile (/= 1) . iterate collatzNext

maxCol :: Int -> Int
maxCol = go 1 1 where
  go ml i n | i > n = ml
  go ml i n = go (max ml (collatzLen i)) (i + 1) n

main = do
  [n] <- getArgs
  print $ maxCol (read n :: Int)

现在,对于ghc-7.10.1 -O2 -fllvmn = 10000000,上述代码在 2.1 秒内运行,而C程序在 2.4 秒内运行.如果我们想在不使用LLVM和GHC 7.10的情况下达到类似的性能,则只需手动应用重要的缺少的优化方法即可:

Now, for ghc-7.10.1 -O2 -fllvm and n = 10000000, the above code runs in 2.1 seconds, while the C program runs in 2.4 seconds. If we want to achieve similar performance without LLVM and GHC 7.10, we just have to manually apply the important missing optimizations:

collatzLen :: Int -> Int
collatzLen = go 0 where
  go l 1 = l
  go l n | n .&. 1 == 0 = go (l + 1) (shiftR n 1)
         | otherwise    = go (l + 1) (shiftR (3 * n + 1) 1)

maxCol :: Int -> Int
maxCol = go 1 1 where
  go ml i n | i > n = ml
  go ml i n = go (max ml (collatzLen i)) (i + 1) n

main = do
  [n] <- getArgs
  print $ maxCol (read n :: Int)

现在,使用ghc-7.8.4 -O2n = 10000000,我们的代码将在 2.6 秒内运行.

Now, with ghc-7.8.4 -O2 and n = 10000000, our code runs in 2.6 seconds.

这篇关于Haskell:为什么Int的性能比Word64差,为什么我的程序比C慢得多?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆