Haskell:为什么Int表现比Word64差,为什么我的程序比C慢? [英] Haskell: Why does Int performs worse than Word64, and why my program is far slower than C?
问题描述
我正在阅读一篇 Haskell在与Collatz猜想中玩的速度有多慢 ,它基本上说明如果你继续乘以三和加一个奇数,或者将一个二分为一,那么你最终会得到一个。例如,3→10→5→16→8→4→2→1。
本文给出的程序是计算在给定范围内最长的Collatz序列。 C版本是:
#include
int main(int argc,char ** argv){
int max_a0 = atoi(argv [1]);
int longest = 0,max_len = 0;
int a0,len;
unsigned long a;
(a0 = 1; a0 <= max_a0; a0 ++){
a = a0;
len = 0;
while(a!= 1){
len ++;
a =((a%2 == 0)?a:3 * a + 1)/ 2;
}
if(len> max_len){
max_len = len;
最长= a0;
printf((%d,%d)\\\
,max_len,longest);
返回0;
}
使用Clang O2进行编译,它在我的电脑上运行0.2秒。 p>
该文中给出的Haskell版本显式生成整个序列作为列表,然后计算中间列表的长度。它比C版慢10倍。但是,由于作者使用LLVM作为后端,我没有安装,所以我无法重现这一点。使用GHC 7.8和默认后端,它在我的Mac上运行10s,比C版本慢50倍。
然后,我使用尾递归编写一个版本,而不是生成一个中间列表: collatzNext :: Int - > Int
collatzNext a
|即使是a =``div` 2
|否则=(3 * a + 1)`div` 2
collatzLen :: Int - > Int
collatzLen n = collatzIter n 0
where
collatzIter 1 len = len
collatzIter n len = collatzIter(collatzNext n)(len + 1)
main = do
print $ maximum $ [collatzLen x | x< - [1..1000000]]
用GHC 7.8和O2编译, 2s,比C版本慢10倍。
有趣的是,当我将类型注释中的 Int
更改为 Word
,它只花费1秒,快2倍!我已经尝试了BangPatterns进行明确的严格评估,但没有显着的性能提升可能会被注意到 - 我想GHC的严格分析足够聪明来处理这样一个简单的情况。
我的问题是:
- 为什么
Word
版本比Int
一个版本快得多? - 为什么Haskell程序与C中的相比要慢很多?
这个程序的性能取决于几个因素。如果我们把它们全部弄对,那么性能就和C程序一样。仔细研究这些因素:
1。使用并比较正确的单词大小
发布的C代码片段不完全正确;它在所有体系结构上使用32位整数,而64位机器上的Haskell Int
-s是64位。在开始之前,我们应该确保在两个程序中使用相同的字大小。
另外,我们应该在我们的Haskell代码中始终使用原始大小的整型类型。因此,如果我们使用的是64位系统,我们应该使用64位数字,避开 2。分部在 3。检查均匀 检查这个的最快方法是检查最不重要的位。 LLVM可以优化 然而,我发现GHC 7.10有一些性能问题。在这里,我们不会为<$ c $ c> Word 4。在 中融合列表这是一个至关重要的因素。链接的博客帖子与此有点过时。 GHC 7.8不能在这里做融合,但是7.10可以。这意味着使用GHC 7.10和LLVM,我们可以方便地获得类似C的性能,而不会显着修改原始代码。 使用 5。在计算最大长度时将列表重新排序 即使GHC 7.10也可以做到这一点,所以我们不得不求助于手动循环写入。 现在,对于 现在,使用 I was reading an article of how slow Haskell it is in playing with Collatz conjecture, which basically states if you keep multiplying three and plus one to an odd number, or dividing an even one with two, you will eventually get one. For instance, 3 -> 10 -> 5 -> 16 -> 8 -> 4 -> 2 -> 1. The program given in this article is to calculate the longest Collatz sequence in a given range. The C version is: Compiling with Clang O2, it runs on my computer for 0.2s. The Haskell version given in that article generates the whole sequence as a list explicitly, and then calculate the length of the intermediate list. It is 10x slower than C version. However, since the author used LLVM as a backend, which I haven't installed, I couldn't reproduce this. Using GHC 7.8 and the default backend, it runs 10s on my Mac, which is 50x slower than the C version. Then, I write a version using tail recursion and not generating an intermediate list: Compiled with GHC 7.8 and O2, it runs for 2s, 10x slower than the C version. Interestingly, when I changed I have tried BangPatterns for explicit strict evaluation, but no significant performance gain could be noticed — I guess GHC's strict analysis is smart enough to handle such a simple scenario. My questions are:
The performance of this program depends on several factors. If we get all of them right, the performance becomes the same as that of the C program. Going through these factors: 1. Using and comparing the right word sizes The posted C code snippet is not exactly right; it uses 32-bit integers on all architectures, while the Haskell Also, we should always use native-sized integral types in our Haskell code. So if we're on a 64-bit system, we should use 64-bit numbers, and steer clear of 2. Division in 3. Checking evenness The fastest way to check this is by checking the least significant bit. LLVM can optimize However, I found something of a performance bug with GHC 7.10. Here we don't get an inlined 4. Fusing away lists in This is a crucial factor. The linked blog post is a bit out-of-date with regards to this. GHC 7.8 can't do fusion here, but 7.10 can. This means that with GHC 7.10 and LLVM we can conveniently get C-like performance without significantly modifying the original code. With 5. Fusing away lists when computing the maximum length Not even GHC 7.10 can do this, so we have to resort to manual loop-writing. Now, for Now, with 这篇关于Haskell:为什么Int表现比Word64差,为什么我的程序比C慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋! Int32
-s和 Word32
-s,除非他们有特定需求。这是因为对非本地整数的操作大部分实现为外国电话,而不是primops ,所以它们明显变慢。
collatzNext
div
是由于 div
处理负数,所以 Int
慢于 quot
不同。如果我们使用 div
并切换到 Word
,程序会变得更快,因为 div <对于
获得内联 Word
,code>与 quot
相同。 quot
与 Int
一样好。但是,这还不及C那么快。我们可以通过将位移到右边来将它们除以2。由于某些原因,LLVM在这个例子中并没有做这种特殊的强度减少,所以我们最好是手动做,用替换
。 quot n 2
shiftR n 1
甚至
,而本地代码不能。因此,如果我们使用本机代码, even n
可以替换为 n。& ;. 1 == 0
,这样可以提高性能。
甚至
,这会破坏性能(使用堆分配调用函数 Word的
框在代码最热门的部分是这样做的)。因此,我们应该使用 rem n 2 == 0
或 n。& ;. 1 == 0
而不是偶
。然而, Int
的甚至
获得内联。
collatzLen
collatz下一个a =(如果连a a else 3 * a + 1)`2`b $ b collatzLen a0 = length $ takeWhile(/ = 1)$ iterate collatzNext a0
maxColLen n =最大$ map collatzLen [1..n]
main = do
[n]< - getArgs
print $ maxColLen(read n :: Int)
ghc-7.10.1 -O2 -fllvm
和 n = 10000000
,上述程序以 2.8 秒,而等效的C程序则在 2.4 秒内运行。如果我在没有LLVM的情况下编译相同的代码,那么我将获得 12.4 第二运行时。这种放缓完全是因为对甚至
缺乏优化。如果我们使用 a。& ;. 1 == 0
,那么减速将消失。
collatzNext a =(如果a。& ;. 1 == 0 then a else 3 * a + 1)`shiftR` 1
collatzLen =长度。 takeWhile(/ = 1)。 iterate collatzNext
maxCol :: Int - > Int
maxCol = go 1 1其中
go ml i n |我> n = ml
go ml in = go(max ml(collatzLen i))(i + 1)n
main = do
[n]< - getArgs
print $ maxCol(read n :: Int)
ghc-7.10.1 -O2 -fllvm
和 n = 10000000
,上面的代码在 2.1 秒内运行,while C程序在 2.4 秒内运行。如果我们希望在没有LLVM和GHC 7.10的情况下实现类似的性能,我们只需手动应用重要的缺失优化:
collatzLen: :Int - > Int
collatzLen = go 0 where
go l 1 = l
go l n | n& ;. 1 == 0 = go(l + 1)(shiftR n 1)
|否则= go(l + 1)(shiftR(3 * n + 1)1)
maxCol :: Int - > Int
maxCol = go 1 1其中
go ml i n |我> n = ml
go ml in = go(max ml(collatzLen i))(i + 1)n
main = do
[n]< - getArgs
print $ maxCol(read n :: Int)
ghc-7.8.4 -O2
和 n = 10000000
,我们的代码在 2.6 秒内运行。 #include <stdio.h>
int main(int argc, char **argv) {
int max_a0 = atoi(argv[1]);
int longest = 0, max_len = 0;
int a0, len;
unsigned long a;
for (a0 = 1; a0 <= max_a0; a0++) {
a = a0;
len = 0;
while (a != 1) {
len++;
a = ((a%2==0)? a : 3*a+1)/2;
}
if (len > max_len) {
max_len = len;
longest = a0;
}
}
printf("(%d, %d)\n", max_len, longest);
return 0;
}
collatzNext :: Int -> Int
collatzNext a
| even a = a `div` 2
| otherwise = (3 * a + 1) `div` 2
collatzLen :: Int -> Int
collatzLen n = collatzIter n 0
where
collatzIter 1 len = len
collatzIter n len = collatzIter (collatzNext n) (len + 1)
main = do
print $ maximum $ [collatzLen x | x <- [1..1000000]]
Int
in the type annotation to Word
, it only spent 1s, 2x faster!
Word
version so faster compared with the Int
one?Int
-s are 64 bit on a 64-bit machine. Before anything else, we should make sure to use the same word size in both programs. Int32
-s and Word32
-s, unless there is a specific need for them. This is because operations on non-native integers are mostly implemented as foreign calls rather than primops, so they're significantly slower. collatzNext
div
is slower than quot
for Int
, because div
handles negative numbers differently. If we use div
and switch to Word
, the program gets faster, because div
is the same as quot
for Word
. quot
with Int
works just as fine. However, this is still not as fast as C. We can divide by two by shifting bits to the right. For some reason not even LLVM does this particular strength reduction in this example, so we're best off doing it by hand, replacing quot n 2
by shiftR n 1
. even
to this, while the native codegen cannot. So, if we're on native codegen, even n
could be replaced with n .&. 1 == 0
, and this gives a nice performance boost.even
for Word
, which wrecks performance (calling a function with a heap-allocated Word
box in the hottest part of the code does this). So here we should use rem n 2 == 0
or n .&. 1 == 0
instead of even
. The even
for Int
gets inlined fine though. collatzLen
collatzNext a = (if even a then a else 3*a+1) `quot` 2
collatzLen a0 = length $ takeWhile (/= 1) $ iterate collatzNext a0
maxColLen n = maximum $ map collatzLen [1..n]
main = do
[n] <- getArgs
print $ maxColLen (read n :: Int)
ghc-7.10.1 -O2 -fllvm
and n = 10000000
, the above program runs in 2.8 seconds, while the equivalent C program runs in 2.4 seconds. If I compile the same code without LLVM, then I instead get 12.4 second runtime. This slowdown is entirely because of the lack of optimization on even
. If we use a .&. 1 == 0
, then the slowdown disappears.collatzNext a = (if a .&. 1 == 0 then a else 3*a+1) `shiftR` 1
collatzLen = length . takeWhile (/= 1) . iterate collatzNext
maxCol :: Int -> Int
maxCol = go 1 1 where
go ml i n | i > n = ml
go ml i n = go (max ml (collatzLen i)) (i + 1) n
main = do
[n] <- getArgs
print $ maxCol (read n :: Int)
ghc-7.10.1 -O2 -fllvm
and n = 10000000
, the above code runs in 2.1 seconds, while the C program runs in 2.4 seconds. If we want to achieve similar performance without LLVM and GHC 7.10, we just have to manually apply the important missing optimizations:collatzLen :: Int -> Int
collatzLen = go 0 where
go l 1 = l
go l n | n .&. 1 == 0 = go (l + 1) (shiftR n 1)
| otherwise = go (l + 1) (shiftR (3 * n + 1) 1)
maxCol :: Int -> Int
maxCol = go 1 1 where
go ml i n | i > n = ml
go ml i n = go (max ml (collatzLen i)) (i + 1) n
main = do
[n] <- getArgs
print $ maxCol (read n :: Int)
ghc-7.8.4 -O2
and n = 10000000
, our code runs in 2.6 seconds.