改进抓捕实施 [英] Improving treap implementation
问题描述
这是我实现一种treap(隐式键和节点中存储的一些其他信息): http:/ /hpaste.org/42839/treap_with_implicit_keys
Here's my implementation of a sort of treap (with implicit keys and some additional information stored in nodes): http://hpaste.org/42839/treap_with_implicit_keys
根据剖析数据,GC需要80%的时间用于此程序。据我所知,这是由于每次修改节点时,重新创建到根的路径上的每个节点。
According to profiling data GC takes 80% of time for this program. As far as I understand, it's caused by the fact that every time a node is 'modified', each node on the path to the root is recreated.
有没有什么可以在这里提高性能,或者我必须下降到ST monad的领域?
Is there something I can do here to improve performance or I have to descend into the realm of ST monad?
推荐答案
使用GHC 7.0.3,我可以重现您的重GC流程:
Using GHC 7.0.3, I can reproduce your heavy GC behavior:
$ time ./A +RTS -s
%GC time 92.9% (92.9% elapsed)
./A +RTS -s 7.24s user 0.04s system 99% cpu 7.301 total
我花了10分钟的时间来完成程序。这是我做的,按顺序:
I spent 10 minutes going through the program. Here's what I did, in order:
- 设置GHC的-H标志,增加GC的限制
- 检查解包
- 改进内联
- 调整第一代分配区域
- Set GHC's -H flag, increasing limits in the GC
- Check unpacking
- Improve inlining
- Adjust the first generation allocation area
导致10倍的加速,GC约为45%的时间。
Resulting in a 10 fold speedup, and GC around 45% of time.
按顺序,使用GHC的魔法 -H
标志,我们可以减少这个运行时:
In order, using GHC's magic -H
flag, we can reduce that runtime quite a bit:
$ time ./A +RTS -s -H
%GC time 74.3% (75.3% elapsed)
./A +RTS -s -H 2.34s user 0.04s system 99% cpu 2.392 total
不错!
Tree
节点上的UNPACK编译指示将不会执行任何操作,因此请删除它们。
The UNPACK pragmas on the Tree
nodes won't do anything, so remove those.
内联更新
更多的运行时:
./A +RTS -s -H 1.84s user 0.04s system 99% cpu 1.883 total
与内嵌 height
./A +RTS -s -H 1.74s user 0.03s system 99% cpu 1.777 total
所以while它是快速的,GC仍然是主导的 - 因为我们测试分配。
我们可以做的一件事是增加第一个gen大小:
So while it is fast, GC is still dominating -- since we're testing allocation, after all. One thing we can do is increase the first gen size:
$ time ./A +RTS -s -A200M
%GC time 45.1% (40.5% elapsed)
./A +RTS -s -A200M 0.71s user 0.16s system 99% cpu 0.872 total
如JohnL所建议的那样,增加展开门槛有所帮助,
And increasing the unfolding threshold, as JohnL suggested, helps a little,
./A +RTS -s -A100M 0.74s user 0.09s system 99% cpu 0.826 total
这是什么,比我们开始快10倍?不错。
which is what, 10x faster than we started? Not bad.
使用 ghc-gc-tune ,您可以将运行时视为 -A
和 -H
,
Using ghc-gc-tune, you can see runtime as a function of -A
and -H
,
有趣的是,最佳运行时间使用非常大的 -A
值,例如
Interestingly, the best running times use very large -A
values, e.g.
$ time ./A +RTS -A500M
./A +RTS -A500M 0.49s user 0.28s system 99% cpu 0.776s
这篇关于改进抓捕实施的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!