为什么Haskell代码与-O运行速度较慢？ [英] Why does this Haskell code run slower with -O?

查看：157 发布时间：2018/6/4 15:08:08 haskell optimization ghc compiler-bug

本文介绍了为什么Haskell代码与-O运行速度较慢？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

-O ，但是 -O 这段Haskell代码运行得慢很多 >应该非危险。谁能告诉我发生了什么事？如果它很重要，那就是尝试解决这个问题，并且它使用二进制搜索和持久段树：
import Control.Monad import Data.Array 数据Node = Leaf Int - 值 | Branch Int节点Node - sum，left child，right child 类型NodeArray = Array Int节点 - 创建一个范围为[l，r）的空节点 create：：Int - > Int - >节点创建l r | l + 1 == r = Leaf 0 |否则=分支0（创建l m）（创建m r）其中m =（l + r）`div` 2 - 获得范围[0，r）的和。节点的范围是[nl，nr） sumof :: Node - > Int - > Int - > Int - > Int sumof（Leaf val）r nl nr | nr <= r = val |否则= 0 sumof（Branch sum lc rc）r nl nr | nr <= r = sum | r> nl =（sumof lc r nl m）+（sumof rc r m nr） |否则= 0 其中m =（nl + nr）`div` 2 - 将x的值增加1.节点的范围是[nl，nr）增加::节点 - > Int - > Int - > Int - >节点增加（Leaf val）x nl nr = Leaf（val + 1）增加（Branch sum lc rc）x nl nr | x < m =分支（sum + 1）（增加lc x nl m）rc |否则=分支（sum + 1）lc（增加rc xm nr）其中m =（nl + nr）`div` 2 - 签名表示全部 tonodes :: Int - > [Int] - > [节点] tonodes n =反向。 tonodes。反向其中 tonodes':: [Int] - > [节点] tonodes'（h：t）=增加h'h 0 n：s'其中s'@（h'：_）= tonodes't tonodes'_ = [create 0 n ] - 找到[l，r]中的最小m，使得（predicate m）为True binarysearch ::（Int→> Bool） - > Int - > Int - > Int binarysearch谓词l r | l == r = r |谓词m =二元搜索谓词l m |否则= binarysearch谓词（m + 1）r 其中m =（l + r）`div` 2 - main，字面意义 main :: IO（） main = do [n，m] < - fmap（map read。words）getLine nodes <-fmap（listArray（0，n）。tonodes n。map（subtract 1 ）。map read。words）getLine replicateM_ m $ query n nodes where query :: Int - > NodeArray - > IO（） query n nodes = do [p，k]< - fmap（map read。words）getLine print $ binarysearch（ok nodes npk）0 n 其中 ok :: NodeArray - > Int - > Int - > Int - > Int - > Bool ok nodes npks =（sumof（nodes！min（p + s + 1）n）s 0 n） - （sumof（nodes！max（p-s）0）s 0 n）> = k
（这与

这是我在C ++中的输入生成器：
#include< cstdio> #include< cstdlib> 使用namespace std; int main（int argc，char * argv []）{ srand（1827）; int n = 100000; if（argc> 1） sscanf（argv [1]，％d，& n）; printf（％d％d \ n，n，n）; for（int i = 0; i< n; i ++） printf（％d％c，rand（）％n + 1，i == n - 1？'\\\ '：''）; for（int i = 0; i< n; i ++）{ int p = rand（）％n; int k = rand（）％n + 1; printf（％d％d \\\ ，p，k）; $ b如果你没有C ++编译器可用，这是 ./ gen.exe 1000 这是我电脑上的执行结果： $ b $ ghc --version Glorious Glasgow Haskell编译系统，版本7.8.3 $ ghc -fforce-recomp 1827.hs [编译] Main（1827.hs，1827.o）链接1827.exe ... $ time ./gen.exe 1000 | ./1827.exe> / dev / null real 0m0.088s user 0m0.015s sys 0m0.015s $ ghc -fforce-recomp -O 1827.hs [1 1]编译Main（1827.hs，1827.o）链接1827.exe ... $ time ./gen.exe 1000 | ./1827.exe> / dev / null real 0m2.969s user 0m0.000s sys 0m0.045s 这就是堆概况总结： $ b $ ghc -fforce -recomp -rtsopts ./1827.hs [1的1]编译主（1827.hs，1827.o）链接1827.exe ... $ ./gen.exe 1000 | ./1827.exe + RTS -s> / dev / null 在堆中分配的70,207,096字节在GC 中复制的2,112,416字节最大驻留期间的613,368字节（3个样本） 28,816字节的最大值使用中的3MB总内存（由于碎片造成0MB丢失）总时间（已用）平均暂停最大暂停 Gen 0 132 colls，0 par 0.00s 0.00s 0.0000s 0.0004s Gen 1 3 colls，0 par 0.00s 0.00s 0.0006s 0.0010s INIT时间0.00s（经过0.00s） MUT时间0.03s（经过0.03s） GC时间0.00s（ 0.01s经过）退出时间0.00s（已过去0.00s）总时间0.03s（已过0.04s）％GC时间0.0％（已过去14.7％）分配率2,250,213,011字节每MUT秒生产力总用户的100.0％，占已用总数的83.1％ $ ghc -fforce-recomp -O -rtsopts ./1827.hs [1的1]编译Main（1827.hs，1827.o）链接1827.exe ... $ ./gen.exe 1000 | ./1827.exe + RTS -s> / dev / null 在堆中分配的6,009,233,608字节在GC 中复制622,682,200字节最大居民地址数为443,240字节（505个样本） 48,256字节最大地址使用中的3 MB总内存（由于碎片而丢失0 MB）总时间（已用）平均暂停最大暂停 Gen 0 10945 colls，0 par 0.72s 0.63s 0.0001s 0.0004s Gen 1 505 colls，0 par 0.16s 0.13s 0.0003s 0.0005s INIT时间0.00s（经过0.00s） MUT时间2.00s（已过2.13s） GC时间0.87s（经过0.76s）退出时间0.00s（已过去0.00s）总时间2.89s（已过时2.90s）％GC时间30.3％（已过去26.4％）分配率每MUT秒3,009,412,603个字节秒生产力总用户数的69.7％，已用完总数的69.4％解决方案你的代码发生了什么， -O 让我放大你的主函数，并稍微改写它： main :: IO（） main = do [n，m]< - fmap（地图读取。字）getLine 行< - getLine let nodes = listArray（0，n）。 tonodes n。地图（减1）。地图阅读。字$ line replicateM_ m $查询n个节点显然，这里的意图是 NodeArray 创建一次，然后用于查询的每个 m code>。遗憾的是，GHC将此代码有效地转换为 main = do [n，m]< - fmap（map read。words）getLine line< - getLine replicateM_ m $ do let nodes = listArray（0，n）。 tonodes n。地图（减1）。地图阅读。单词$行查询n个节点你可以立刻在这里看到问题。 p> 什么是状态黑客，为什么它会破坏我的程序性能？原因是状态黑客，它（粗略地）说：当某些类型的 IO a 时，假定它只被调用一次。官方文档并没有更详细的说明： $ b -fno-state-hack 关闭state hack，任何带有State＃token的lambda都被认为是单条记录，因此将内联内容放在内部是可以的。这可以提高IO和ST monad代码的性能，但它会带来减少共享的风险。粗略地说，这个想法如下：如果使用 IO 类型和where子句定义函数，例如 foo x = do putStrLn y putStrLn y 其中y = ... x ... 类型 IO a 类型的东西可以被视为类型 RealWord - > （a，RealWorld）。在上述观点中，上述变成了（大致） foo x = let y = ... x .. 。in \world1 - > let（world2，（））= putStrLn y world1 （world3，（））= putStrLn y world2 $ b $ in（world3，（））调用 foo 会（通常）看起来像这样 foo参数世界。但是 foo 的定义只有一个参数，而另一个只在本地lambda表达式中被使用！这对 foo 来说是一个非常缓慢的调用。如果代码看起来像这样会更快： foo x world1 = let y = .. （world3，（））= putStrLn y world1 $ b $ let（world3，（））= putStrLn y world2 $ b $ in（world3，（））这称为eta-expansion，可以在各种理由下完成（例如，通过分析函数的定义，由检查它是如何被调用的，以及 - 在这种情况下是类型定向启发式检测）。不幸的是，对于 foo 的调用实际上是 let fooArgument = foo参数的形式，即使用参数没有 world 通过（尚未）。在原始代码中，如果多次使用 fooArgument ， y 仍然只会计算一次，并且是共享的。在修改后的代码中，每次都会重新计算 y - 准确地说明节点发生了什么。可以固定吗？可能。尝试这样做，请参阅＃9388 。解决这个问题的一个问题是，即使编译器无法确定地知道这一点，它仍然会在很多情况下都会降低性能。并且有可能出现技术上不正常的情况，即共享丢失，但仍然是有益的，因为加速呼叫的速度超过了重新计算的额外成本。所以目前还不清楚从哪里去。 This piece of Haskell code runs much slower with -O, but -O should be non-dangerous. Can anyone tell me what happened? If it matters, it is an attempt to solve this problem, and it uses binary search and persistent segment tree: import Control.Monad import Data.Array data Node = Leaf Int -- value | Branch Int Node Node -- sum, left child, right child type NodeArray = Array Int Node -- create an empty node with range [l, r) create :: Int -> Int -> Node create l r | l + 1 == r = Leaf 0 | otherwise = Branch 0 (create l m) (create m r) where m = (l + r) `div` 2 -- Get the sum in range [0, r). The range of the node is [nl, nr) sumof :: Node -> Int -> Int -> Int -> Int sumof (Leaf val) r nl nr | nr <= r = val | otherwise = 0 sumof (Branch sum lc rc) r nl nr | nr <= r = sum | r > nl = (sumof lc r nl m) + (sumof rc r m nr) | otherwise = 0 where m = (nl + nr) `div` 2 -- Increase the value at x by 1. The range of the node is [nl, nr) increase :: Node -> Int -> Int -> Int -> Node increase (Leaf val) x nl nr = Leaf (val + 1) increase (Branch sum lc rc) x nl nr | x < m = Branch (sum + 1) (increase lc x nl m) rc | otherwise = Branch (sum + 1) lc (increase rc x m nr) where m = (nl + nr) `div` 2 -- signature said it all tonodes :: Int -> [Int] -> [Node] tonodes n = reverse . tonodes' . reverse where tonodes' :: [Int] -> [Node] tonodes' (h:t) = increase h' h 0 n : s' where s'@(h':_) = tonodes' t tonodes' _ = [create 0 n] -- find the minimum m in [l, r] such that (predicate m) is True binarysearch :: (Int -> Bool) -> Int -> Int -> Int binarysearch predicate l r | l == r = r | predicate m = binarysearch predicate l m | otherwise = binarysearch predicate (m+1) r where m = (l + r) `div` 2 -- main, literally main :: IO () main = do [n, m] <- fmap (map read . words) getLine nodes <- fmap (listArray (0, n) . tonodes n . map (subtract 1) . map read . words) getLine replicateM_ m $ query n nodes where query :: Int -> NodeArray -> IO () query n nodes = do [p, k] <- fmap (map read . words) getLine print $ binarysearch (ok nodes n p k) 0 n where ok :: NodeArray -> Int -> Int -> Int -> Int -> Bool ok nodes n p k s = (sumof (nodes ! min (p + s + 1) n) s 0 n) - (sumof (nodes ! max (p - s) 0) s 0 n) >= k (This is exactly the same code with code review but this question addresses another problem.) This is my input generator in C++: #include <cstdio> #include <cstdlib> using namespace std; int main (int argc, char * argv[]) { srand(1827); int n = 100000; if(argc > 1) sscanf(argv[1], "%d", &n); printf("%d %d\n", n, n); for(int i = 0; i < n; i++) printf("%d%c", rand() % n + 1, i == n - 1 ? '\n' : ' '); for(int i = 0; i < n; i++) { int p = rand() % n; int k = rand() % n + 1; printf("%d %d\n", p, k); } } In case you don't have a C++ compiler available, this is the result of ./gen.exe 1000. This is the execution result on my computer: $ ghc --version The Glorious Glasgow Haskell Compilation System, version 7.8.3 $ ghc -fforce-recomp 1827.hs [1 of 1] Compiling Main ( 1827.hs, 1827.o ) Linking 1827.exe ... $ time ./gen.exe 1000 | ./1827.exe > /dev/null real 0m0.088s user 0m0.015s sys 0m0.015s $ ghc -fforce-recomp -O 1827.hs [1 of 1] Compiling Main ( 1827.hs, 1827.o ) Linking 1827.exe ... $ time ./gen.exe 1000 | ./1827.exe > /dev/null real 0m2.969s user 0m0.000s sys 0m0.045s And this is the heap profile summary: $ ghc -fforce-recomp -rtsopts ./1827.hs [1 of 1] Compiling Main ( 1827.hs, 1827.o ) Linking 1827.exe ... $ ./gen.exe 1000 | ./1827.exe +RTS -s > /dev/null 70,207,096 bytes allocated in the heap 2,112,416 bytes copied during GC 613,368 bytes maximum residency (3 sample(s)) 28,816 bytes maximum slop 3 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 132 colls, 0 par 0.00s 0.00s 0.0000s 0.0004s Gen 1 3 colls, 0 par 0.00s 0.00s 0.0006s 0.0010s INIT time 0.00s ( 0.00s elapsed) MUT time 0.03s ( 0.03s elapsed) GC time 0.00s ( 0.01s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 0.03s ( 0.04s elapsed) %GC time 0.0% (14.7% elapsed) Alloc rate 2,250,213,011 bytes per MUT second Productivity 100.0% of total user, 83.1% of total elapsed $ ghc -fforce-recomp -O -rtsopts ./1827.hs [1 of 1] Compiling Main ( 1827.hs, 1827.o ) Linking 1827.exe ... $ ./gen.exe 1000 | ./1827.exe +RTS -s > /dev/null 6,009,233,608 bytes allocated in the heap 622,682,200 bytes copied during GC 443,240 bytes maximum residency (505 sample(s)) 48,256 bytes maximum slop 3 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 10945 colls, 0 par 0.72s 0.63s 0.0001s 0.0004s Gen 1 505 colls, 0 par 0.16s 0.13s 0.0003s 0.0005s INIT time 0.00s ( 0.00s elapsed) MUT time 2.00s ( 2.13s elapsed) GC time 0.87s ( 0.76s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 2.89s ( 2.90s elapsed) %GC time 30.3% (26.4% elapsed) Alloc rate 3,009,412,603 bytes per MUT second Productivity 69.7% of total user, 69.4% of total elapsed 解决方案 I guess it is time this question gets a proper answer. What happened to your code with -O Let me zoom in your main function, and rewrite it slightly: main :: IO () main = do [n, m] <- fmap (map read . words) getLine line <- getLine let nodes = listArray (0, n) . tonodes n . map (subtract 1) . map read . words $ line replicateM_ m $ query n nodes Clearly, the intention here is that the NodeArray is created once, and then used in every of the m invocations of query. Unfortunately, GHC transforms this code to, effectively, main = do [n, m] <- fmap (map read . words) getLine line <- getLine replicateM_ m $ do let nodes = listArray (0, n) . tonodes n . map (subtract 1) . map read . words $ line query n nodes and you can immediately see the problem here. What is the state hack, and why does it destroy my programs performance The reason is the state hack, which says (roughly): "When something is of type IO a, assume it is called only once.". The official documentation is not much more elaborate: -fno-state-hack Turn off the "state hack" whereby any lambda with a State# token as argument is considered to be single-entry, hence it is considered OK to inline things inside it. This can improve performance of IO and ST monad code, but it runs the risk of reducing sharing. Roughly, the idea is as follows: If you define a function with an IO type and a where clause, e.g. foo x = do putStrLn y putStrLn y where y = ...x... Something of type IO a can be viewed as something of type RealWord -> (a, RealWorld). In that view, the above becomes (roughly) foo x = let y = ...x... in \world1 -> let (world2, ()) = putStrLn y world1 let (world3, ()) = putStrLn y world2 in (world3, ()) A call to foo would (typically) look like this foo argument world. But the definition of foo only takes one argument, and the other one is only consumed later by a local lambda expression! That is going to be a very slow call to foo. It would be much faster if the code would look like this: foo x world1 = let y = ...x... in let (world2, ()) = putStrLn y world1 let (world3, ()) = putStrLn y world2 in (world3, ()) This is called eta-expansion and done on various grounds (e.g. by analyzing the function’s definition, by checking how it is being called, and – in this case – type directed heuristics). Unfortunately it is unsound if the call to foo is actually of the form let fooArgument = foo argument, i.e. with an argument, but no world passed (yet). In the original code, if fooArgument is then used several times, y will still be calculated only once, and shared. In the modified code, y will be re-calculated every time – precisely what has happened to your nodes. Can things be fixed? Possibly. See #9388 for an attempt at doing so. The problem with fixing it is that it will cost performance in a lot of cases where the transformation happens to ok, even though the compiler cannot possibly know that for sure. And there are probably cases where it is technically not ok, i.e. sharing is lost, but it is still beneficial because the speedups from the faster calling outweigh the extra cost of the recalculation. So it is not clear where to go from here. 这篇关于为什么Haskell代码与-O运行速度较慢？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！查看全文

为什么Haskell代码与-O运行速度较慢？ [英] Why does this Haskell code run slower with -O?

问题描述

可以固定吗？

What happened to your code with `-O`

What is the state hack, and why does it destroy my programs performance

Can things be fixed?

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

为什么Haskell代码与-O运行速度较慢？ [英] Why does this Haskell code run slower with -O?

问题描述

可以固定吗？

What happened to your code with -O

What is the state hack, and why does it destroy my programs performance

Can things be fixed?

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

What happened to your code with `-O`

登录关闭