为什么Haskell代码与-O运行速度较慢? [英] Why does this Haskell code run slower with -O?

查看:157
本文介绍了为什么Haskell代码与-O运行速度较慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

-O ,但是 -O 这段Haskell代码运行得慢很多 >应该非危险。谁能告诉我发生了什么事?如果它很重要,那就是尝试解决这个问题,并且它使用二进制搜索和持久段树:

  import Control.Monad 
import Data.Array

数据Node =
Leaf Int - 值
| Branch Int节点Node - sum,left child,right child
类型NodeArray = Array Int节点

- 创建一个范围为[l,r)的空节点
create: :Int - > Int - >节点
创建l r
| l + 1 == r = Leaf 0
|否则=分支0(创建l m)(创建m r)
其中m =(l + r)`div` 2

- 获得范围[0,r)的和。节点的范围是[nl,nr)
sumof :: Node - > Int - > Int - > Int - > Int
sumof(Leaf val)r nl nr
| nr <= r = val
|否则= 0
sumof(Branch sum lc rc)r nl nr
| nr <= r = sum
| r> nl =(sumof lc r nl m)+(sumof rc r m nr)
|否则= 0
其中m =(nl + nr)`div` 2

- 将x的值增加1.节点的范围是[nl,nr)
增加::节点 - > Int - > Int - > Int - >节点
增加(Leaf val)x nl nr = Leaf(val + 1)
增加(Branch sum lc rc)x nl nr
| x < m =分支(sum + 1)(增加lc x nl m)rc
|否则=分支(sum + 1)lc(增加rc xm nr)
其中m =(nl + nr)`div` 2

- 签名表示全部
tonodes :: Int - > [Int] - > [节点]
tonodes n =反向。 tonodes。反向
其中
tonodes':: [Int] - > [节点]
tonodes'(h:t)=增加h'h 0 n:s'其中s'@(h':_)= tonodes't
tonodes'_ = [create 0 n ]

- 找到[l,r]中的最小m,使得(predicate m)为True
binarysearch ::(Int→> Bool) - > Int - > Int - > Int
binarysearch谓词l r
| l == r = r
|谓词m =二元搜索谓词l m
|否则= binarysearch谓词(m + 1)r
其中m =(l + r)`div` 2

- main,字面意义
main :: IO()
main = do
[n,m] < - fmap(map read。words)getLine
nodes <-fmap(listArray(0,n)。tonodes n。map(subtract 1 )。map read。words)getLine
replicateM_ m $ query n nodes
where
query :: Int - > NodeArray - > IO()
query n nodes = do
[p,k]< - fmap(map read。words)getLine
print $ binarysearch(ok nodes npk)0 n
其中
ok :: NodeArray - > Int - > Int - > Int - > Int - > Bool
ok nodes npks =(sumof(nodes!min(p + s + 1)n)s 0 n) - (sumof(nodes!max(p-s)0)s 0 n)> = k

(这与

这是我在C ++中的输入生成器:

#include< cstdio>
#include< cstdlib>
使用namespace std;
int main(int argc,char * argv []){
srand(1827);
int n = 100000;
if(argc> 1)
sscanf(argv [1],%d,& n);
printf(%d%d \ n,n,n);
for(int i = 0; i< n; i ++)
printf(%d%c,rand()%n + 1,i == n - 1?'\\\
':'');
for(int i = 0; i< n; i ++){
int p = rand()%n;
int k = rand()%n + 1;
printf(%d%d \\\
,p,k);




$ b

如果你没有C ++编译器可用,这是 ./ gen.exe 1000



这是我电脑上的执行结果:
$ b

  $ ghc --version 
Glorious Glasgow Haskell编译系统,版本7.8.3
$ ghc -fforce-recomp 1827.hs
[编译] Main(1827.hs,1827.o)
链接1827.exe ...
$ time ./gen.exe 1000 | ./1827.exe> / dev / null
real 0m0.088s
user 0m0.015s
sys 0m0.015s
$ ghc -fforce-recomp -O 1827.hs
[1 1]编译Main(1827.hs,1827.o)
链接1827.exe ...
$ time ./gen.exe 1000 | ./1827.exe> / dev / null
real 0m2.969s
user 0m0.000s
sys 0m0.045s

这就是堆概况总结:
$ b

  $ ghc -fforce -recomp -rtsopts ./1827.hs 
[1的1]编译主(1827.hs,1827.o)
链接1827.exe ...
$ ./gen.exe 1000 | ./1827.exe + RTS -s> / dev / null
在堆中分配的70,207,096字节
在GC
中复制的2,112,416字节最大驻留期间的613,368字节(3个样本)
28,816字节的最大值
使用中的3MB总内存(由于碎片造成0MB丢失)
总时间(已用)平均暂停最大暂停
Gen 0 132 colls,0 par 0.00s 0.00s 0.0000s 0.0004s
Gen 1 3 colls,0 par 0.00s 0.00s 0.0006s 0.0010s
INIT时间0.00s(经过0.00s)
MUT时间0.03s(经过0.03s)
GC时间0.00s( 0.01s经过)
退出时间0.00s(已过去0.00s)
总时间0.03s(已过0.04s)
%GC时间0.0%(已过去14.7%)
分配率2,250,213,011字节每MUT秒
生产力总用户的100.0%,占已用总数的83​​.1%
$ ghc -fforce-recomp -O -rtsopts ./1827.hs
[1的1]编译Main(1827.hs,1827.o)
链接1827.exe ...
$ ./gen.exe 1000 | ./1827.exe + RTS -s> / dev / null
在堆中分配的6,009,233,608字节
在GC
中复制622,682,200字节最大居民地址数为443,240字节(505个样本)
48,256字节最大地址
使用中的3 MB总内存(由于碎片而丢失0 MB)
总时间(已用)平均暂停最大暂停
Gen 0 10945 colls,0 par 0.72s 0.63s 0.0001s 0.0004s
Gen 1 505 colls,0 par 0.16s 0.13s 0.0003s 0.0005s
INIT时间0.00s(经过0.00s)
MUT时间2.00s(已过2.13s)
GC时间0.87s(经过0.76s)
退出时间0.00s(已过去0.00s)
总时间2.89s(已过时2.90s)
%GC时间30.3%(已过去26.4%)
分配率每MUT秒3,009,412,603个字节秒
生产力总用户数的69.7%,已用完总数的69.4%


解决方案



你的代码发生了什么, -O

让我放大你的主函数,并稍微改写它:

  main :: IO()
main = do
[n,m]< - fmap(地图读取。字)getLine
行< - getLine
let nodes = listArray(0,n)。 tonodes n。地图(减1)。地图阅读。字$ line
replicateM_ m $查询n个节点

显然,这里的意图是 NodeArray 创建一次,然后用于查询的每个 m code>。



遗憾的是,GHC将此代码有效地转换为

  main = do 
[n,m]< - fmap(map read。words)getLine
line< - getLine
replicateM_ m $ do
let nodes = listArray(0,n)。 tonodes n。地图(减1)。地图阅读。单词$行
查询n个节点

你可以立刻在这里看到问题。 p>

什么是状态黑客,为什么它会破坏我的程序性能?

原因是状态黑客,它(粗略地)说:当某些类型的 IO a 时,假定它只被调用一次。 官方文档并没有更详细的说明:
$ b


-fno-state-hack



关闭state hack,任何带有State#token的lambda都被认为是单条记录,因此将内联内容放在内部是可以的。这可以提高IO和ST monad代码的性能,但它会带来减少共享的风险。

粗略地说,这个想法如下:如果使用 IO 类型和where子句定义函数,例如

  foo x = do 
putStrLn y
putStrLn y
其中y = ... x ...

类型 IO a 类型的东西可以被视为类型 RealWord - > (a,RealWorld)。在上述观点中,上述变成了(大致)

  foo x = 
let y = ... x .. 。in
\world1 - >
let(world2,())= putStrLn y world1
(world3,())= putStrLn y world2 $ b $ in(world3,())

调用 foo 会(通常)看起来像这样 foo参数世界。但是 foo 的定义只有一个参数,而另一个只在本地lambda表达式中被使用!这对 foo 来说是一个非常缓慢的调用。如果代码看起来像这样会更快:

  foo x world1 = 
let y = .. (world3,())= putStrLn y world1 $ b $ let(world3,())= putStrLn y world2 $ b $ in(world3,())

这称为eta-expansion,可以在各种理由下完成(例如,通过分析函数的定义,由检查它是如何被调用的,以及 - 在这种情况下是类型定向启发式检测)。



不幸的是,对于 foo 的调用实际上是 let fooArgument = foo参数的形式,即使用参数没有 world 通过(尚未)。在原始代码中,如果多次使用 fooArgument y 仍然只会计算一次,并且是共享的。在修改后的代码中,每次都会重新计算 y - 准确地说明节点发生了什么。



可以固定吗?



可能。尝试这样做,请参阅#9388 。解决这个问题的一个问题是,即使编译器无法确定地知道这一点,它仍然会在很多情况下都会降低性能。并且有可能出现技术上不正常的情况,即共享丢失,但仍然是有益的,因为加速呼叫的速度超过了重新计算的额外成本。所以目前还不清楚从哪里去。


This piece of Haskell code runs much slower with -O, but -O should be non-dangerous. Can anyone tell me what happened? If it matters, it is an attempt to solve this problem, and it uses binary search and persistent segment tree:

import Control.Monad
import Data.Array

data Node =
      Leaf   Int           -- value
    | Branch Int Node Node -- sum, left child, right child
type NodeArray = Array Int Node

-- create an empty node with range [l, r)
create :: Int -> Int -> Node
create l r
    | l + 1 == r = Leaf 0
    | otherwise  = Branch 0 (create l m) (create m r)
    where m = (l + r) `div` 2

-- Get the sum in range [0, r). The range of the node is [nl, nr)
sumof :: Node -> Int -> Int -> Int -> Int
sumof (Leaf val) r nl nr
    | nr <= r   = val
    | otherwise = 0
sumof (Branch sum lc rc) r nl nr
    | nr <= r   = sum
    | r  > nl   = (sumof lc r nl m) + (sumof rc r m nr)
    | otherwise = 0
    where m = (nl + nr) `div` 2

-- Increase the value at x by 1. The range of the node is [nl, nr)
increase :: Node -> Int -> Int -> Int -> Node
increase (Leaf val) x nl nr = Leaf (val + 1)
increase (Branch sum lc rc) x nl nr
    | x < m     = Branch (sum + 1) (increase lc x nl m) rc
    | otherwise = Branch (sum + 1) lc (increase rc x m nr)
    where m = (nl + nr) `div` 2

-- signature said it all
tonodes :: Int -> [Int] -> [Node]
tonodes n = reverse . tonodes' . reverse
    where
        tonodes' :: [Int] -> [Node]
        tonodes' (h:t) = increase h' h 0 n : s' where s'@(h':_) = tonodes' t
        tonodes' _ = [create 0 n]

-- find the minimum m in [l, r] such that (predicate m) is True
binarysearch :: (Int -> Bool) -> Int -> Int -> Int
binarysearch predicate l r
    | l == r      = r
    | predicate m = binarysearch predicate l m
    | otherwise   = binarysearch predicate (m+1) r
    where m = (l + r) `div` 2

-- main, literally
main :: IO ()
main = do
    [n, m] <- fmap (map read . words) getLine
    nodes <- fmap (listArray (0, n) . tonodes n . map (subtract 1) . map read . words) getLine
    replicateM_ m $ query n nodes
    where
        query :: Int -> NodeArray -> IO ()
        query n nodes = do
            [p, k] <- fmap (map read . words) getLine
            print $ binarysearch (ok nodes n p k) 0 n
            where
                ok :: NodeArray -> Int -> Int -> Int -> Int -> Bool
                ok nodes n p k s = (sumof (nodes ! min (p + s + 1) n) s 0 n) - (sumof (nodes ! max (p - s) 0) s 0 n) >= k

(This is exactly the same code with code review but this question addresses another problem.)

This is my input generator in C++:

#include <cstdio>
#include <cstdlib>
using namespace std;
int main (int argc, char * argv[]) {
    srand(1827);
    int n = 100000;
    if(argc > 1)
        sscanf(argv[1], "%d", &n);
    printf("%d %d\n", n, n);
    for(int i = 0; i < n; i++)
        printf("%d%c", rand() % n + 1, i == n - 1 ? '\n' : ' ');
    for(int i = 0; i < n; i++) {
        int p = rand() % n;
        int k = rand() % n + 1;
        printf("%d %d\n", p, k);
    }
}

In case you don't have a C++ compiler available, this is the result of ./gen.exe 1000.

This is the execution result on my computer:

$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 7.8.3
$ ghc -fforce-recomp 1827.hs
[1 of 1] Compiling Main             ( 1827.hs, 1827.o )
Linking 1827.exe ...
$ time ./gen.exe 1000 | ./1827.exe > /dev/null
real    0m0.088s
user    0m0.015s
sys     0m0.015s
$ ghc -fforce-recomp -O 1827.hs
[1 of 1] Compiling Main             ( 1827.hs, 1827.o )
Linking 1827.exe ...
$ time ./gen.exe 1000 | ./1827.exe > /dev/null
real    0m2.969s
user    0m0.000s
sys     0m0.045s

And this is the heap profile summary:

$ ghc -fforce-recomp -rtsopts ./1827.hs
[1 of 1] Compiling Main             ( 1827.hs, 1827.o )
Linking 1827.exe ...
$ ./gen.exe 1000 | ./1827.exe +RTS -s > /dev/null
      70,207,096 bytes allocated in the heap
       2,112,416 bytes copied during GC
         613,368 bytes maximum residency (3 sample(s))
          28,816 bytes maximum slop
               3 MB total memory in use (0 MB lost due to fragmentation)
                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0       132 colls,     0 par    0.00s    0.00s     0.0000s    0.0004s
  Gen  1         3 colls,     0 par    0.00s    0.00s     0.0006s    0.0010s
  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    0.03s  (  0.03s elapsed)
  GC      time    0.00s  (  0.01s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    0.03s  (  0.04s elapsed)
  %GC     time       0.0%  (14.7% elapsed)
  Alloc rate    2,250,213,011 bytes per MUT second
  Productivity 100.0% of total user, 83.1% of total elapsed
$ ghc -fforce-recomp -O -rtsopts ./1827.hs
[1 of 1] Compiling Main             ( 1827.hs, 1827.o )
Linking 1827.exe ...
$ ./gen.exe 1000 | ./1827.exe +RTS -s > /dev/null
   6,009,233,608 bytes allocated in the heap
     622,682,200 bytes copied during GC
         443,240 bytes maximum residency (505 sample(s))
          48,256 bytes maximum slop
               3 MB total memory in use (0 MB lost due to fragmentation)
                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0     10945 colls,     0 par    0.72s    0.63s     0.0001s    0.0004s
  Gen  1       505 colls,     0 par    0.16s    0.13s     0.0003s    0.0005s
  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    2.00s  (  2.13s elapsed)
  GC      time    0.87s  (  0.76s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    2.89s  (  2.90s elapsed)
  %GC     time      30.3%  (26.4% elapsed)
  Alloc rate    3,009,412,603 bytes per MUT second
  Productivity  69.7% of total user, 69.4% of total elapsed

解决方案

I guess it is time this question gets a proper answer.

What happened to your code with -O

Let me zoom in your main function, and rewrite it slightly:

main :: IO ()
main = do
    [n, m] <- fmap (map read . words) getLine
    line <- getLine
    let nodes = listArray (0, n) . tonodes n . map (subtract 1) . map read . words $ line
    replicateM_ m $ query n nodes

Clearly, the intention here is that the NodeArray is created once, and then used in every of the m invocations of query.

Unfortunately, GHC transforms this code to, effectively,

main = do
    [n, m] <- fmap (map read . words) getLine
    line <- getLine
    replicateM_ m $ do
        let nodes = listArray (0, n) . tonodes n . map (subtract 1) . map read . words $ line
        query n nodes

and you can immediately see the problem here.

What is the state hack, and why does it destroy my programs performance

The reason is the state hack, which says (roughly): "When something is of type IO a, assume it is called only once.". The official documentation is not much more elaborate:

-fno-state-hack

Turn off the "state hack" whereby any lambda with a State# token as argument is considered to be single-entry, hence it is considered OK to inline things inside it. This can improve performance of IO and ST monad code, but it runs the risk of reducing sharing.

Roughly, the idea is as follows: If you define a function with an IO type and a where clause, e.g.

foo x = do
    putStrLn y
    putStrLn y
  where y = ...x...

Something of type IO a can be viewed as something of type RealWord -> (a, RealWorld). In that view, the above becomes (roughly)

foo x = 
   let y = ...x... in 
   \world1 ->
     let (world2, ()) = putStrLn y world1
     let (world3, ()) = putStrLn y world2
     in  (world3, ())

A call to foo would (typically) look like this foo argument world. But the definition of foo only takes one argument, and the other one is only consumed later by a local lambda expression! That is going to be a very slow call to foo. It would be much faster if the code would look like this:

foo x world1 = 
   let y = ...x... in 
   let (world2, ()) = putStrLn y world1
   let (world3, ()) = putStrLn y world2
   in  (world3, ())

This is called eta-expansion and done on various grounds (e.g. by analyzing the function’s definition, by checking how it is being called, and – in this case – type directed heuristics).

Unfortunately it is unsound if the call to foo is actually of the form let fooArgument = foo argument, i.e. with an argument, but no world passed (yet). In the original code, if fooArgument is then used several times, y will still be calculated only once, and shared. In the modified code, y will be re-calculated every time – precisely what has happened to your nodes.

Can things be fixed?

Possibly. See #9388 for an attempt at doing so. The problem with fixing it is that it will cost performance in a lot of cases where the transformation happens to ok, even though the compiler cannot possibly know that for sure. And there are probably cases where it is technically not ok, i.e. sharing is lost, but it is still beneficial because the speedups from the faster calling outweigh the extra cost of the recalculation. So it is not clear where to go from here.

这篇关于为什么Haskell代码与-O运行速度较慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆