Haskell:为什么Int表现比Word64差,为什么我的程序比C慢? [英] Haskell: Why does Int performs worse than Word64, and why my program is far slower than C?

查看:101
本文介绍了Haskell:为什么Int表现比Word64差,为什么我的程序比C慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在阅读一篇 Haskell在与Collat​​z猜想中玩的速度有多慢 ,它基本上说明如果你继续乘以三和加一个奇数,或者将一个二分为一,那么你最终会得到一个。例如,3→10→5→16→8→4→2→1。

本文给出的程序是计算在给定范围内最长的Collat​​z序列。 C版本是:

  #include  

int main(int argc,char ** argv){
int max_a0 = atoi(argv [1]);
int longest = 0,max_len = 0;
int a0,len;
unsigned long a;

(a0 = 1; a0 <= max_a0; a0 ++){
a = a0;
len = 0;

while(a!= 1){
len ++;
a =((a%2 == 0)?a:3 * a + 1)/ 2;
}

if(len> max_len){
max_len = len;
最长= a0;


printf((%d,%d)\\\
,max_len,longest);
返回0;
}

使用Clang O2进行编译,它在我的电脑上运行0.2秒。 p>

该文中给出的Haskell版本显式生成整个序列作为列表,然后计算中间列表的长度。它比C版慢10倍。但是,由于作者使用LLVM作为后端,我没有安装,所以我无法重现这一点。使用GHC 7.8和默认后端,它在我的Mac上运行10s,比C版本慢50倍。

然后,我使用尾递归编写一个版本,而不是生成一个中间列表:

  collat​​zNext :: Int  - > Int 
collat​​zNext a
|即使是a =``div` 2
|否则=(3 * a + 1)`div` 2

collat​​zLen :: Int - > Int
collat​​zLen n = collat​​zIter n 0
where
collat​​zIter 1 len = len
collat​​zIter n len = collat​​zIter(collat​​zNext n)(len + 1)

main = do
print $ maximum $ [collat​​zLen x | x< - [1..1000000]]

用GHC 7.8和O2编译, 2s,比C版本慢10倍。

有趣的是,当我将类型注释中的 Int 更改为 Word ,它只花费1秒,快2倍!

我已经尝试了BangPatterns进行明确的严格评估,但没有显着的性能提升可能会被注意到 - 我想GHC的严格分析足够聪明来处理这样一个简单的情况。



我的问题是:


  1. 为什么 Word 版本比 Int 一个版本快得多?

  2. 为什么Haskell程序与C中的相比要慢很多?


解决方案

这个程序的性能取决于几个因素。如果我们把它们全部弄对,那么性能就和C程序一样。仔细研究这些因素:

1。使用并比较正确的单词大小



发布的C代码片段不完全正确;它在所有体系结构上使用32位整数,而64位机器上的Haskell Int -s是64位。在开始之前,我们应该确保在两个程序中使用相同的字大小。

另外,我们应该在我们的Haskell代码中始终使用原始大小的整型类型。因此,如果我们使用的是64位系统,我们应该使用64位数字,避开 Int32 -s和 Word32 -s,除非他们有特定需求。这是因为对非本地整数的操作大部分实现为外国电话,而不是primops ,所以它们明显变慢。

2。分部在 collat​​zNext



div 是由于 div 处理负数,所以 Int 慢于 quot 不同。如果我们使用 div 并切换到 Word ,程序会变得更快,因为 div <对于 Word ,code>与 quot 相同。 quot Int 一样好。但是,这还不及C那么快。我们可以通过将位移到右边来将它们除以2。由于某些原因,LLVM在这个例子中并没有做这种特殊的强度减少,所以我们最好是手动做,用替换 quot n 2 shiftR n 1

3。检查均匀



检查这个的最快方法是检查最不重要的位。 LLVM可以优化甚至,而本地代码不能。因此,如果我们使用本机代码, even n 可以替换为 n。& ;. 1 == 0 ,这样可以提高性能。



然而,我发现GHC 7.10有一些性能问题。在这里,我们不会为<$​​ c $ c> Word 获得内联甚至,这会破坏性能(使用堆分配调用函数 Word的框在代码最热门的部分是这样做的)。因此,我们应该使用 rem n 2 == 0 n。& ;. 1 == 0 而不是。然而, Int 甚至获得内联。

4。在 collat​​zLen



中融合列表这是一个至关重要的因素。链接的博客帖子与此有点过时。 GHC 7.8不能在这里做融合,但是7.10可以。这意味着使用GHC 7.10和LLVM,我们可以方便地获得类似C的性能,而不会显着修改原始代码。

  collat​​z下一个a =(如果连a a else 3 * a + 1)`2`b $ b collat​​zLen a0 = length $ takeWhile(/ = 1)$ iterate collat​​zNext a0 
maxColLen n =最大$ map collat​​zLen [1..n]

main = do
[n]< - getArgs
print $ maxColLen(read n :: Int)

使用 ghc-7.10.1 -O2 -fllvm n = 10000000 ,上述程序以 2.8 秒,而等效的C程序则在 2.4 秒内运行。如果我在没有LLVM的情况下编译相同的代码,那么我将获得 12.4 第二运行时。这种放缓完全是因为对甚至缺乏优化。如果我们使用 a。& ;. 1 == 0 ,那么减速将消失。



5。在计算最大长度时将列表重新排序



即使GHC 7.10也可以做到这一点,所以我们不得不求助于手动循环写入。

  collat​​zNext a =(如果a。& ;. 1 == 0 then a else 3 * a + 1)`shiftR` 1 
collat​​zLen =长度。 takeWhile(/ = 1)。 iterate collat​​zNext

maxCol :: Int - > Int
maxCol = go 1 1其中
go ml i n |我> n = ml
go ml in = go(max ml(collat​​zLen i))(i + 1)n

main = do
[n]< - getArgs
print $ maxCol(read n :: Int)

现在,对于 ghc-7.10.1 -O2 -fllvm n = 10000000 ,上面的代码在 2.1 秒内运行,while C程序在 2.4 秒内运行。如果我们希望在没有LLVM和GHC 7.10的情况下实现类似的性能,我们只需手动应用重要的缺失优化:

  collat​​zLen: :Int  - > Int 
collat​​zLen = go 0 where
go l 1 = l
go l n | n& ;. 1 == 0 = go(l + 1)(shiftR n 1)
|否则= go(l + 1)(shiftR(3 * n + 1)1)

maxCol :: Int - > Int
maxCol = go 1 1其中
go ml i n |我> n = ml
go ml in = go(max ml(collat​​zLen i))(i + 1)n

main = do
[n]< - getArgs
print $ maxCol(read n :: Int)

现在,使用 ghc-7.8.4 -O2 n = 10000000 ,我们的代码在 2.6 秒内运行。

I was reading an article of how slow Haskell it is in playing with Collatz conjecture, which basically states if you keep multiplying three and plus one to an odd number, or dividing an even one with two, you will eventually get one. For instance, 3 -> 10 -> 5 -> 16 -> 8 -> 4 -> 2 -> 1.

The program given in this article is to calculate the longest Collatz sequence in a given range. The C version is:

#include <stdio.h>

int main(int argc, char **argv) {
   int max_a0 = atoi(argv[1]); 
   int longest = 0, max_len = 0;
   int a0, len;
   unsigned long a;

   for (a0 = 1; a0 <= max_a0; a0++) {
      a = a0;
      len = 0;

      while (a != 1) {
         len++;
         a = ((a%2==0)? a : 3*a+1)/2;
      }

      if (len > max_len) {
         max_len = len;
         longest = a0;
      }
   }
   printf("(%d, %d)\n", max_len, longest);
   return 0;
}

Compiling with Clang O2, it runs on my computer for 0.2s.

The Haskell version given in that article generates the whole sequence as a list explicitly, and then calculate the length of the intermediate list. It is 10x slower than C version. However, since the author used LLVM as a backend, which I haven't installed, I couldn't reproduce this. Using GHC 7.8 and the default backend, it runs 10s on my Mac, which is 50x slower than the C version.

Then, I write a version using tail recursion and not generating an intermediate list:

collatzNext :: Int -> Int
collatzNext a
  | even a    = a `div` 2
  | otherwise = (3 * a + 1) `div` 2

collatzLen :: Int -> Int
collatzLen n = collatzIter n 0
  where
    collatzIter 1 len = len
    collatzIter n len = collatzIter (collatzNext n) (len + 1)

main = do
  print $ maximum $ [collatzLen x | x <- [1..1000000]]

Compiled with GHC 7.8 and O2, it runs for 2s, 10x slower than the C version.

Interestingly, when I changed Int in the type annotation to Word, it only spent 1s, 2x faster!

I have tried BangPatterns for explicit strict evaluation, but no significant performance gain could be noticed — I guess GHC's strict analysis is smart enough to handle such a simple scenario.

My questions are:

  1. Why is the Word version so faster compared with the Int one?
  2. Why is this Haskell program so slow compared with that in C?

解决方案

The performance of this program depends on several factors. If we get all of them right, the performance becomes the same as that of the C program. Going through these factors:

1. Using and comparing the right word sizes

The posted C code snippet is not exactly right; it uses 32-bit integers on all architectures, while the Haskell Int-s are 64 bit on a 64-bit machine. Before anything else, we should make sure to use the same word size in both programs.

Also, we should always use native-sized integral types in our Haskell code. So if we're on a 64-bit system, we should use 64-bit numbers, and steer clear of Int32-s and Word32-s, unless there is a specific need for them. This is because operations on non-native integers are mostly implemented as foreign calls rather than primops, so they're significantly slower.

2. Division in collatzNext

div is slower than quot for Int, because div handles negative numbers differently. If we use div and switch to Word, the program gets faster, because div is the same as quot for Word. quot with Int works just as fine. However, this is still not as fast as C. We can divide by two by shifting bits to the right. For some reason not even LLVM does this particular strength reduction in this example, so we're best off doing it by hand, replacing quot n 2 by shiftR n 1.

3. Checking evenness

The fastest way to check this is by checking the least significant bit. LLVM can optimize even to this, while the native codegen cannot. So, if we're on native codegen, even n could be replaced with n .&. 1 == 0, and this gives a nice performance boost.

However, I found something of a performance bug with GHC 7.10. Here we don't get an inlined even for Word, which wrecks performance (calling a function with a heap-allocated Word box in the hottest part of the code does this). So here we should use rem n 2 == 0 or n .&. 1 == 0 instead of even. The even for Int gets inlined fine though.

4. Fusing away lists in collatzLen

This is a crucial factor. The linked blog post is a bit out-of-date with regards to this. GHC 7.8 can't do fusion here, but 7.10 can. This means that with GHC 7.10 and LLVM we can conveniently get C-like performance without significantly modifying the original code.

collatzNext a = (if even a then a else 3*a+1) `quot` 2
collatzLen a0 = length $ takeWhile (/= 1) $ iterate collatzNext a0
maxColLen n   = maximum $ map collatzLen [1..n]

main = do
    [n] <- getArgs
    print $ maxColLen (read n :: Int) 

With ghc-7.10.1 -O2 -fllvm and n = 10000000, the above program runs in 2.8 seconds, while the equivalent C program runs in 2.4 seconds. If I compile the same code without LLVM, then I instead get 12.4 second runtime. This slowdown is entirely because of the lack of optimization on even. If we use a .&. 1 == 0, then the slowdown disappears.

5. Fusing away lists when computing the maximum length

Not even GHC 7.10 can do this, so we have to resort to manual loop-writing.

collatzNext a = (if a .&. 1 == 0 then a else 3*a+1) `shiftR` 1
collatzLen    = length . takeWhile (/= 1) . iterate collatzNext

maxCol :: Int -> Int
maxCol = go 1 1 where
  go ml i n | i > n = ml
  go ml i n = go (max ml (collatzLen i)) (i + 1) n

main = do
  [n] <- getArgs
  print $ maxCol (read n :: Int)

Now, for ghc-7.10.1 -O2 -fllvm and n = 10000000, the above code runs in 2.1 seconds, while the C program runs in 2.4 seconds. If we want to achieve similar performance without LLVM and GHC 7.10, we just have to manually apply the important missing optimizations:

collatzLen :: Int -> Int
collatzLen = go 0 where
  go l 1 = l
  go l n | n .&. 1 == 0 = go (l + 1) (shiftR n 1)
         | otherwise    = go (l + 1) (shiftR (3 * n + 1) 1)

maxCol :: Int -> Int
maxCol = go 1 1 where
  go ml i n | i > n = ml
  go ml i n = go (max ml (collatzLen i)) (i + 1) n

main = do
  [n] <- getArgs
  print $ maxCol (read n :: Int)

Now, with ghc-7.8.4 -O2 and n = 10000000, our code runs in 2.6 seconds.

这篇关于Haskell:为什么Int表现比Word64差,为什么我的程序比C慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆