随着越来越多的盒装阵列分配code变慢 [英] Code becomes slower as more boxed arrays are allocated
问题描述
编辑:的原来,东西一般(不只是阵列/ REF操作)的多个磁盘阵列放缓已创建的,所以我想这可能只是测量增加GC倍,可能不会像我想象的那样奇怪。但我真的很想知道(并学习如何发现),虽然这里发生了什么,如果有一些方法,以减轻code,创造大量的短小阵列的这种效果。原题如下:的
在调查图书馆一些奇怪的基准测试结果,我偶然发现了一些问题,我不知道,但它可能是真的很明显。看来,许多操作所花费的时间(创建一个新的 MutableArray
,读取或修改 IOREF
)增加比例以在存储器阵列的数目
这是第一个例子:
模块主要
哪里进口Control.Monad
进口资质Data.Primitive为P
进口Control.Concurrent
进口Data.IORef
进口Criterion.Main
进口Control.Monad.Primitive(PrimState)主要做=
令n = 100000
allTheArrays< - newIORef []
defaultMain $
[板凳阵列创建$做
newArr< - P.newArray 64():: IO(P.MutableArray(PrimState IO)())
atomicModifyIORef'allTheArrays(\\ 1→(newArr:升,()))
]
我们正在创造一个新的数组并将其添加到堆栈。随着标准做更多的样品和堆栈增长,阵列创建需要更多的时间,这似乎线性和定期增长:
更奇, IOREF
读取和写入的影响,我们可以看到 atomicModifyIORef
越来越快presumably随着越来越多的阵列GC'd。
=主要做
令n = 1000000
ARRS< - replicateM(N)$(P.newArray 64():: IO(P.MutableArray(PrimState IO)()))
- 打印$长度ARRS - 这工作,使事情更快
arrsRef< - newIORef ARRS
defaultMain $
[板凳IOREF的原子MODS$
- nfIO $ - 或者这也适用
replicateM $ 1000
atomicModifyIORef'arrsRef(\\(答:) - GT;(如,()))
]
要么被注释掉摆脱这种行为,但我不知道这两条线的为什么(也许我们以后强制列表的脊柱,元素实际上可以收集)。
问题
- 这里发生了什么?
- 它是预期的行为?
- 有没有一种方法可以让我避免这种放缓?
修改:我认为这事做的GC需要更长的时间,但我想了解更多precisely发生的事情,特别是在第一基准
奖金例如
最后,这里是一个可以用来$ P $一个简单的测试程序对分配阵列和时间一串 atomicModifyIORef
第的一些数字。这似乎表现出缓慢IOREF行为。
进口Control.Monad
进口统环境进口资质Data.Primitive为P
进口Control.Concurrent
进口Control.Concurrent.Chan
进口Control.Concurrent.MVar
进口Data.IORef
进口Criterion.Main
进口Control.Exception(评估)
进口Control.Monad.Primitive(PrimState)进口资质Data.Array.IO作为IO
进口资质Data.Vector.Mutable为V进口System.CPUTime
进口System.Mem(performGC)
进口统环境
主要:: IO()
主要做=
[N]< - FMAP(图读)getArgs
ARRS< - replicateM(N)$(P.newArray 64():: IO(P.MutableArray(PrimState IO)()))
arrsRef< - newIORef ARRS T0< - getCPUTimeDouble CNT< - newIORef(0 ::智力)
replicateM_ $ 1000000
(atomicModifyIORef'CNT(\\正>第(n + 1,()))>&GT =评价) T1< - getCPUTimeDouble - 确保这些坚持围绕
readIORef CNT>> =打印
readIORef arrsRef>> =(翻转P.readArray 0头)>> =打印 putStrLn时间:
打印(T1 - T0)
与堆轮廓 -HY
显示大部分 MUT_ARR_PTRS_CLEAN
,我不完全理解。
如果要复制,这里是我一直在使用的小集团文件
名称:小并发基准
版本:0.1.0.0
建设型:简单
阴谋版本:> = 1.10可执行小并发基准
主是:Main.hs
集结取决于:基地> = 4.6
,标准
,原始 默认语言:Haskell2010
GHC选项:-O2 -rtsopts
修改:这是另外一个测试程序,可以用来比较数组的大小相同的堆放缓VS [整数]
。这需要一些试验和错误调整 N
和观察分析得到可比运行。
main4 :: IO()
main4 = DO
[N]< - FMAP(图读)getArgs
让NS = [(1 ::整数)。N]
arrsRef< - newIORef纳秒
打印$长度NS T0< - getCPUTimeDouble
MAPM(评估。和)(尾[1 .. 10000])
T1< - getCPUTimeDouble readIORef arrsRef>> =(打印和) 打印(T1 - T0)
有趣的是,当我测试这个我发现阵列相同堆大小价值影响性能程度大于 [整数]
。例如。
基线20M 200M
清单:0.7 1.0 4.4
阵列:0.7 2.6 20.4
结论(WIP)
-
这是最有可能是由于GC行为
-
但可变拆箱阵列似乎导致更多断绝放缓(见上文)。设置
+ RTS -A200M
带来了与列表版本线阵列垃圾版本的性能,支持,这与GC做。 -
放缓成正比分配阵列的数目,而不是阵列中的总细胞数。这里是一组运行表示,对于一个类似的试验,以
main4
中,分配阵列的数目的效果上都分配所花费的时间,以及一个完全无关有效负载 。这是16777216细胞总数(分给然而,许多阵列):数组大小阵列创建时间时间有效载荷:
8 3.164 14.264
16 1.532 9.008
32 1.208 6.668
64 0.644 3.78
128 0.528 2.052
256 0.444 3.08
512 0.336 4.648
1024 0.356 0.652和对
16777216 * 4
细胞运行此相同的测试,显示为上述基本相同的有效载荷次,只有向下移动两个地方。 -
据我了解有关GHC是如何工作的,看着(3),我认为这可能开销不必指向所有这些阵列中的<一个坚持围绕简单地href=\"https://ghc.haskell.org/trac/ghc/wiki/Commentary/Rts/Storage/GC/RememberedSets#Arrays%3aMUT_ARR_PTRS\"相对=nofollow> 记录置的(参见:的这里),以及任何额外开销,导致了GC。
您支付线性的开销每个剩下的生活和提升为老一代的可变数组每一个轻微的GC。这是因为GHC无条件地放置可变名单上的所有可变数组和遍历整个列表中的每个轻微的GC。请参见 https://ghc.haskell.org/trac/ghc/ticket/7662 了解更多信息,还有我的邮件列表回答您的问题:<一href=\"http://www.haskell.org/pipermail/glasgow-haskell-users/2014-May/024976.html\">http://www.haskell.org/pipermail/glasgow-haskell-users/2014-May/024976.html
Edit: It turns out that things generally (not just array/ref operations) slow down the more arrays have been created, so I guess this might just be measuring increased GC times and might not be as strange as I thought. But I'd really like to know (and learn how to find out) what's happening here though, and if there's some way to mitigate this effect in code that creates lots of smallish arrays. Original question follows.
In investigating some weird benchmarking results in a library, I stumbled upon some behavior I don't understand, though it might be really obvious. It seems that the time taken for many operations (creating a new MutableArray
, reading or modifying an IORef
) increases in proportion to the number of arrays in memory.
Here's the first example:
module Main
where
import Control.Monad
import qualified Data.Primitive as P
import Control.Concurrent
import Data.IORef
import Criterion.Main
import Control.Monad.Primitive(PrimState)
main = do
let n = 100000
allTheArrays <- newIORef []
defaultMain $
[ bench "array creation" $ do
newArr <- P.newArray 64 () :: IO (P.MutableArray (PrimState IO) ())
atomicModifyIORef' allTheArrays (\l-> (newArr:l,()))
]
We're creating a new array and adding it to a stack. As criterion does more samples and the stack grows, array creation takes more time, and this seems to grow linearly and regularly:
Even more odd, IORef
reads and writes are affected, and we can see the atomicModifyIORef'
getting faster presumably as more arrays are GC'd.
main = do
let n = 1000000
arrs <- replicateM (n) $ (P.newArray 64 () :: IO (P.MutableArray (PrimState IO) ()))
-- print $ length arrs -- THIS WORKS TO MAKE THINGS FASTER
arrsRef <- newIORef arrs
defaultMain $
[ bench "atomic-mods of IORef" $
-- nfIO $ -- OR THIS ALSO WORKS
replicateM 1000 $
atomicModifyIORef' arrsRef (\(a:as)-> (as,()))
]
Either of the two lines that are commented get rid of this behavior but I'm not sure why (maybe after we force the spine of the list, the elements can actually by collected).
Questions
- What's happening here?
- Is it expected behavior?
- Is there a way I can avoid this slowdown?
Edit: I assume this has something to do with GC taking longer, but I'd like to understand more precisely what's happening, especially in the first benchmark.
Bonus example
Finally, here's a simple test program that can be used to pre-allocate some number of arrays and time a bunch of atomicModifyIORef
s. This seems to exhibit the slow IORef behavior.
import Control.Monad
import System.Environment
import qualified Data.Primitive as P
import Control.Concurrent
import Control.Concurrent.Chan
import Control.Concurrent.MVar
import Data.IORef
import Criterion.Main
import Control.Exception(evaluate)
import Control.Monad.Primitive(PrimState)
import qualified Data.Array.IO as IO
import qualified Data.Vector.Mutable as V
import System.CPUTime
import System.Mem(performGC)
import System.Environment
main :: IO ()
main = do
[n] <- fmap (map read) getArgs
arrs <- replicateM (n) $ (P.newArray 64 () :: IO (P.MutableArray (PrimState IO) ()))
arrsRef <- newIORef arrs
t0 <- getCPUTimeDouble
cnt <- newIORef (0::Int)
replicateM_ 1000000 $
(atomicModifyIORef' cnt (\n-> (n+1,())) >>= evaluate)
t1 <- getCPUTimeDouble
-- make sure these stick around
readIORef cnt >>= print
readIORef arrsRef >>= (flip P.readArray 0 . head) >>= print
putStrLn "The time:"
print (t1 - t0)
A heap profile with -hy
shows mostly MUT_ARR_PTRS_CLEAN
, which I don't completely understand.
If you want to reproduce, here is the cabal file I've been using
name: small-concurrency-benchmarks
version: 0.1.0.0
build-type: Simple
cabal-version: >=1.10
executable small-concurrency-benchmarks
main-is: Main.hs
build-depends: base >=4.6
, criterion
, primitive
default-language: Haskell2010
ghc-options: -O2 -rtsopts
Edit: Here's another test program, that can be used to compare slowdown with heaps of the same size of arrays vs [Integer]
. It takes some trial and error adjusting n
and observing profiling to get comparable runs.
main4 :: IO ()
main4= do
[n] <- fmap (map read) getArgs
let ns = [(1::Integer).. n]
arrsRef <- newIORef ns
print $ length ns
t0 <- getCPUTimeDouble
mapM (evaluate . sum) (tails [1.. 10000])
t1 <- getCPUTimeDouble
readIORef arrsRef >>= (print . sum)
print (t1 - t0)
Interestingly, when I test this I find that the same heap size-worth of arrays affects performance to a greater degree than [Integer]
. E.g.
Baseline 20M 200M
Lists: 0.7 1.0 4.4
Arrays: 0.7 2.6 20.4
Conclusions (WIP)
This is most likely due to GC behavior
But mutable unboxed arrays seem to lead to more sever slowdowns (see above). Setting
+RTS -A200M
brings performance of the array garbage version in line with the list version, supporting that this has to do with GC.The slowdown is proportional to the number of arrays allocated, not the number of total cells in the array. Here is a set of runs showing, for a similar test to
main4
, the effects of number of arrays allocated both on the time taken to allocate, and a completely unrelated "payload". This is for 16777216 total cells (divided amongst however many arrays):Array size Array create time Time for "payload": 8 3.164 14.264 16 1.532 9.008 32 1.208 6.668 64 0.644 3.78 128 0.528 2.052 256 0.444 3.08 512 0.336 4.648 1024 0.356 0.652
And running this same test on
16777216*4
cells, shows basically identical payload times as above, only shifted down two places.From what I understand about how GHC works, and looking at (3), I think this overhead might be simply from having pointers to all these arrays sticking around in the remembered set (see also: here), and whatever overhead that causes for the GC.
You are paying linear overhead every minor GC per mutable array that remains live and gets promoted to the old generation. This is because GHC unconditionally places all mutable arrays on the mutable list and traverses the entire list every minor GC. See https://ghc.haskell.org/trac/ghc/ticket/7662 for more information, as well as my mailing list response to your question: http://www.haskell.org/pipermail/glasgow-haskell-users/2014-May/024976.html
这篇关于随着越来越多的盒装阵列分配code变慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!