随着越来越多的盒装阵列分配code变慢 [英] Code becomes slower as more boxed arrays are allocated

查看:105
本文介绍了随着越来越多的盒装阵列分配code变慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

编辑:原来,东西一般(不只是阵列/ REF操作)的多个磁盘阵列放缓已创建的,所以我想这可能只是测量增加GC倍,可能不会像我想象的那样奇怪。但我真的很想知道(并学习如何发现),虽然这里发生了什么,如果有一些方法,以减轻code,创造大量的短小阵列的这种效果。原题如下:


在调查图书馆一些奇怪的基准测试结果,我偶然发现了一些问题,我不知道,但它可能是真的很明显。看来,许多操作所花费的时间(创建一个新的 MutableArray ,读取或修改 IOREF )增加比例以在存储器阵列的数目

这是第一个例子:

 模块主要
    哪里进口Control.Monad
进口资质Data.Primitive为P
进口Control.Concurrent
进口Data.IORef
进口Criterion.Main
进口Control.Monad.Primitive(PrimState)主要做=
  令n = 100000
  allTheArrays< - newIORef []
  defaultMain $
    [板凳阵列创建$做
         newArr< - P.newArray 64():: IO(P.MutableArray(PrimState IO)())
         atomicModifyIORef'allTheArrays(\\ 1→(newArr:升,()))
    ]

我们正在创造一个新的数组并将其添加到堆栈。随着标准做更多的样品和堆栈增长,阵列创建需要更多的时间,这似乎线性和定期增长:

更奇, IOREF 读取和写入的影响,我们可以看到 atomicModifyIORef 越来越快presumably随着越来越多的阵列GC'd。

  =主要做
  令n = 1000000
  ARRS< - replicateM(N)$(P.newArray 64():: IO(P.MutableArray(PrimState IO)()))
   - 打印$长度ARRS - 这工作,使事情更快
  arrsRef< - newIORef ARRS
  defaultMain $
    [板凳IOREF的原子MODS$
     - nfIO $ - 或者这也适用
        replicateM $ 1000
          atomicModifyIORef'arrsRef(\\(答:) - GT;(如,()))
    ]

要么被注释掉摆脱这种行为,但我不知道这两条线的为什么(也许我们以后强制列表的脊柱,元素实际上可以收集)。

问题


  • 这里发生了什么?

  • 它是预期的行为?

  • 有没有一种方法可以让我避免这种放缓?

修改:我认为这事做的GC需要更长的时间,但我想了解更多precisely发生的事情,特别是在第一基准


奖金例如

最后,这里是一个可以用来$ P $一个简单的测试程序对分配阵列和时间一串 atomicModifyIORef 第的一些数字。这似乎表现出缓慢IOREF行为。

 进口Control.Monad
进口统环境进口资质Data.Primitive为P
进口Control.Concurrent
进口Control.Concurrent.Chan
进口Control.Concurrent.MVar
进口Data.IORef
进口Criterion.Main
进口Control.Exception(评估)
进口Control.Monad.Primitive(PrimState)进口资质Data.Array.IO作为IO
进口资质Data.Vector.Mutable为V进口System.CPUTime
进口System.Mem(performGC)
进口统环境
主要:: IO()
主要做=
  [N]< - FMAP(图读)getArgs
  ARRS< - replicateM(N)$(P.newArray 64():: IO(P.MutableArray(PrimState IO)()))
  arrsRef< - newIORef ARRS  T0< - getCPUTimeDouble  CNT< - newIORef(0 ::智力)
  replicateM_ $ 1000000
    (atomicModifyIORef'CNT(\\正>第(n + 1,()))>&GT =评价)  T1< - getCPUTimeDouble   - 确保这些坚持围绕
  readIORef CNT>> =打印
  readIORef arrsRef>> =(翻转P.readArray 0头)>> =打印  putStrLn时间:
  打印(T1 - T0)

与堆轮廓 -HY 显示大部分 MUT_ARR_PTRS_CLEAN ,我不完全理解。

如果要复制,这里是我一直在使用的小集团文件

 名称:小并发基准
版本:0.1.0.0
建设型:简单
阴谋版本:> = 1.10可执行小并发基准
  主是:Main.hs
  集结取决于:基地> = 4.6
                     ,标准
                     ,原始  默认语言:Haskell2010
  GHC选项:-O2 -rtsopts

修改:这是另外一个测试程序,可以用来比较数组的大小相同的堆放缓VS [整数] 。这需要一些试验和错误调整 N 和观察分析得到可比运行。

  main4 :: IO()
main4 = DO
  [N]< - FMAP(图读)getArgs
  让NS = [(1 ::整数)。N]
  arrsRef< - newIORef纳秒
  打印$长度NS  T0< - getCPUTimeDouble
  MAPM(评估。和)(尾[1 .. 10000])
  T1< - getCPUTimeDouble  readIORef arrsRef>> =(打印和)  打印(T1 - T0)

有趣的是,当我测试这个我发现阵列相同堆大小价值影响性能程度大于 [整数] 。例如。

 基线20M 200M
清单:0.7 1.0 4.4
阵列:0.7 2.6 20.4

结论(WIP)


  1. 这是最有可能是由于GC行为


  2. 但可变拆箱阵列似乎导致更多断绝放缓(见上文)。设置 + RTS -A200M 带来了与列表版本线阵列垃圾版本的性能,支持,这与GC做。


  3. 放缓成正比分配阵列的数目,而不是阵列中的总细胞数。这里是一组运行表示,对于一个类似的试验,以 main4 中,分配阵列的数目的效果上都分配所花费的时间,以及一个完全无关有效负载 。这是16777216细胞总数(分给然而,许多阵列):

     数组大小阵列创建时间时间有效载荷:
       8 3.164 14.264
       16 1.532 9.008
       32 1.208 6.668
       64 0.644 3.78
       128 0.528 2.052
       256 0.444 3.08
       512 0.336 4.648
       1024 0.356 0.652

    和对 16777216 * 4 细胞运行此相同的测试,显示为上述基本相同的有效载荷次,只有向下移动两个地方。


  4. 据我了解有关GHC是如何工作的,看着(3),我认为这可能开销不必指向所有这些阵列中的<一个坚持围绕简单地href=\"https://ghc.haskell.org/trac/ghc/wiki/Commentary/Rts/Storage/GC/RememberedSets#Arrays%3aMUT_ARR_PTRS\"相对=nofollow> 记录置的(参见:的这里),以及任何额外开销,导致了GC。



解决方案

您支付线性的开销每个剩下的生活和提升为老一代的可变数组每一个轻微的GC。这是因为GHC无条件地放置可变名单上的所有可变数组和遍历整个列表中的每个轻微的GC。请参见 https://ghc.haskell.org/trac/ghc/ticket/7662 了解更多信息,还有我的邮件列表回答您的问题:<一href=\"http://www.haskell.org/pipermail/glasgow-haskell-users/2014-May/024976.html\">http://www.haskell.org/pipermail/glasgow-haskell-users/2014-May/024976.html

Edit: It turns out that things generally (not just array/ref operations) slow down the more arrays have been created, so I guess this might just be measuring increased GC times and might not be as strange as I thought. But I'd really like to know (and learn how to find out) what's happening here though, and if there's some way to mitigate this effect in code that creates lots of smallish arrays. Original question follows.


In investigating some weird benchmarking results in a library, I stumbled upon some behavior I don't understand, though it might be really obvious. It seems that the time taken for many operations (creating a new MutableArray, reading or modifying an IORef) increases in proportion to the number of arrays in memory.

Here's the first example:

module Main
    where 

import Control.Monad
import qualified Data.Primitive as P
import Control.Concurrent
import Data.IORef
import Criterion.Main
import Control.Monad.Primitive(PrimState)

main = do 
  let n = 100000
  allTheArrays <- newIORef []
  defaultMain $
    [ bench "array creation" $ do
         newArr <- P.newArray 64 () :: IO (P.MutableArray (PrimState IO) ())
         atomicModifyIORef'  allTheArrays (\l-> (newArr:l,()))
    ]

We're creating a new array and adding it to a stack. As criterion does more samples and the stack grows, array creation takes more time, and this seems to grow linearly and regularly:

Even more odd, IORef reads and writes are affected, and we can see the atomicModifyIORef' getting faster presumably as more arrays are GC'd.

main = do 
  let n = 1000000
  arrs <- replicateM (n) $ (P.newArray 64 () :: IO (P.MutableArray (PrimState IO) ()))
  -- print $ length arrs -- THIS WORKS TO MAKE THINGS FASTER
  arrsRef <- newIORef arrs
  defaultMain $
    [ bench "atomic-mods of IORef" $
    -- nfIO $           -- OR THIS ALSO WORKS
        replicateM 1000 $
          atomicModifyIORef' arrsRef (\(a:as)-> (as,()))
    ]

Either of the two lines that are commented get rid of this behavior but I'm not sure why (maybe after we force the spine of the list, the elements can actually by collected).

Questions

  • What's happening here?
  • Is it expected behavior?
  • Is there a way I can avoid this slowdown?

Edit: I assume this has something to do with GC taking longer, but I'd like to understand more precisely what's happening, especially in the first benchmark.


Bonus example

Finally, here's a simple test program that can be used to pre-allocate some number of arrays and time a bunch of atomicModifyIORefs. This seems to exhibit the slow IORef behavior.

import Control.Monad
import System.Environment

import qualified Data.Primitive as P
import Control.Concurrent
import Control.Concurrent.Chan
import Control.Concurrent.MVar
import Data.IORef
import Criterion.Main
import Control.Exception(evaluate)
import Control.Monad.Primitive(PrimState)

import qualified Data.Array.IO as IO
import qualified Data.Vector.Mutable as V

import System.CPUTime
import System.Mem(performGC)
import System.Environment
main :: IO ()
main = do
  [n] <- fmap (map read) getArgs
  arrs <- replicateM (n) $ (P.newArray 64 () :: IO (P.MutableArray (PrimState IO) ()))
  arrsRef <- newIORef arrs

  t0 <- getCPUTimeDouble

  cnt <- newIORef (0::Int)
  replicateM_ 1000000 $
    (atomicModifyIORef' cnt (\n-> (n+1,())) >>= evaluate)

  t1 <- getCPUTimeDouble

  -- make sure these stick around
  readIORef cnt >>= print
  readIORef arrsRef >>= (flip P.readArray 0 . head) >>= print

  putStrLn "The time:"
  print (t1 - t0)

A heap profile with -hy shows mostly MUT_ARR_PTRS_CLEAN, which I don't completely understand.

If you want to reproduce, here is the cabal file I've been using

name:                small-concurrency-benchmarks
version:             0.1.0.0       
build-type:          Simple
cabal-version:       >=1.10

executable small-concurrency-benchmarks
  main-is:             Main.hs 
  build-depends:       base >=4.6
                     , criterion
                     , primitive

  default-language:    Haskell2010
  ghc-options: -O2  -rtsopts

Edit: Here's another test program, that can be used to compare slowdown with heaps of the same size of arrays vs [Integer]. It takes some trial and error adjusting n and observing profiling to get comparable runs.

main4 :: IO ()
main4= do
  [n] <- fmap (map read) getArgs
  let ns = [(1::Integer).. n]
  arrsRef <- newIORef ns
  print $ length ns

  t0 <- getCPUTimeDouble
  mapM (evaluate . sum) (tails [1.. 10000])
  t1 <- getCPUTimeDouble

  readIORef arrsRef >>= (print . sum)

  print (t1 - t0)

Interestingly, when I test this I find that the same heap size-worth of arrays affects performance to a greater degree than [Integer]. E.g.

         Baseline  20M   200M
Lists:   0.7       1.0    4.4
Arrays:  0.7       2.6   20.4

Conclusions (WIP)

  1. This is most likely due to GC behavior

  2. But mutable unboxed arrays seem to lead to more sever slowdowns (see above). Setting +RTS -A200M brings performance of the array garbage version in line with the list version, supporting that this has to do with GC.

  3. The slowdown is proportional to the number of arrays allocated, not the number of total cells in the array. Here is a set of runs showing, for a similar test to main4, the effects of number of arrays allocated both on the time taken to allocate, and a completely unrelated "payload". This is for 16777216 total cells (divided amongst however many arrays):

       Array size   Array create time  Time for "payload":
       8            3.164           14.264
       16           1.532           9.008
       32           1.208           6.668
       64           0.644           3.78
       128          0.528           2.052
       256          0.444           3.08
       512          0.336           4.648
       1024         0.356           0.652
    

    And running this same test on 16777216*4 cells, shows basically identical payload times as above, only shifted down two places.

  4. From what I understand about how GHC works, and looking at (3), I think this overhead might be simply from having pointers to all these arrays sticking around in the remembered set (see also: here), and whatever overhead that causes for the GC.

解决方案

You are paying linear overhead every minor GC per mutable array that remains live and gets promoted to the old generation. This is because GHC unconditionally places all mutable arrays on the mutable list and traverses the entire list every minor GC. See https://ghc.haskell.org/trac/ghc/ticket/7662 for more information, as well as my mailing list response to your question: http://www.haskell.org/pipermail/glasgow-haskell-users/2014-May/024976.html

这篇关于随着越来越多的盒装阵列分配code变慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆