随着分配更多的盒装数组,代码变得更慢 [英] Code becomes slower as more boxed arrays are allocated

查看:10
本文介绍了随着分配更多的盒装数组,代码变得更慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

事实证明,事情通常(不仅仅是数组/引用操作)会减慢创建更多数组的速度,所以我猜这可能只是测量增加的 GC 时间和可能没有我想的那么奇怪.但是我真的很想知道(并学习如何找出)这里发生了什么,以及是否有某种方法可以在创建大量小型数组的代码中减轻这种影响.原始问题如下.

在调查库中一些奇怪的基准测试结果时,我偶然发现了一些我不理解的行为,尽管它可能非常明显.许多操作(创建新的MutableArray,读取或修改IORef)所花费的时间似乎与内存中数组的数量成正比.

In investigating some weird benchmarking results in a library, I stumbled upon some behavior I don't understand, though it might be really obvious. It seems that the time taken for many operations (creating a new MutableArray, reading or modifying an IORef) increases in proportion to the number of arrays in memory.

这是第一个例子:

module Main
    where 

import Control.Monad
import qualified Data.Primitive as P
import Control.Concurrent
import Data.IORef
import Criterion.Main
import Control.Monad.Primitive(PrimState)

main = do 
  let n = 100000
  allTheArrays <- newIORef []
  defaultMain $
    [ bench "array creation" $ do
         newArr <- P.newArray 64 () :: IO (P.MutableArray (PrimState IO) ())
         atomicModifyIORef'  allTheArrays (l-> (newArr:l,()))
    ]

我们正在创建一个新数组并将其添加到堆栈中.随着标准执行更多样本并且堆栈增长,数组创建需要更多时间,而且这似乎线性且有规律地增长:

We're creating a new array and adding it to a stack. As criterion does more samples and the stack grows, array creation takes more time, and this seems to grow linearly and regularly:

更奇怪的是,IORef 的读取和写入受到影响,我们可以看到 atomicModifyIORef' 变得越来越快,大概是因为更多的数组被 GC'd.

Even more odd, IORef reads and writes are affected, and we can see the atomicModifyIORef' getting faster presumably as more arrays are GC'd.

main = do 
  let n = 1000000
  arrs <- replicateM (n) $ (P.newArray 64 () :: IO (P.MutableArray (PrimState IO) ()))
  -- print $ length arrs -- THIS WORKS TO MAKE THINGS FASTER
  arrsRef <- newIORef arrs
  defaultMain $
    [ bench "atomic-mods of IORef" $
    -- nfIO $           -- OR THIS ALSO WORKS
        replicateM 1000 $
          atomicModifyIORef' arrsRef ((a:as)-> (as,()))
    ]

被注释的两行中的任何一个都摆脱了这种行为,但我不确定为什么(也许在我们强制列表的脊椎之后,元素实际上可以被收集).

Either of the two lines that are commented get rid of this behavior but I'm not sure why (maybe after we force the spine of the list, the elements can actually by collected).

  • 这里发生了什么?
  • 这是预期的行为吗?
  • 有什么办法可以避免这种放缓?

编辑:我认为这与 GC 需要更长的时间有关,但我想更准确地了解正在发生的事情,尤其是在第一个基准测试中.

Edit: I assume this has something to do with GC taking longer, but I'd like to understand more precisely what's happening, especially in the first benchmark.

最后,这是一个简单的测试程序,可用于预先分配一些数组并为一堆 atomicModifyIORef 计时.这似乎表现出缓慢的 IORef 行为.

Finally, here's a simple test program that can be used to pre-allocate some number of arrays and time a bunch of atomicModifyIORefs. This seems to exhibit the slow IORef behavior.

import Control.Monad
import System.Environment

import qualified Data.Primitive as P
import Control.Concurrent
import Control.Concurrent.Chan
import Control.Concurrent.MVar
import Data.IORef
import Criterion.Main
import Control.Exception(evaluate)
import Control.Monad.Primitive(PrimState)

import qualified Data.Array.IO as IO
import qualified Data.Vector.Mutable as V

import System.CPUTime
import System.Mem(performGC)
import System.Environment
main :: IO ()
main = do
  [n] <- fmap (map read) getArgs
  arrs <- replicateM (n) $ (P.newArray 64 () :: IO (P.MutableArray (PrimState IO) ()))
  arrsRef <- newIORef arrs

  t0 <- getCPUTimeDouble

  cnt <- newIORef (0::Int)
  replicateM_ 1000000 $
    (atomicModifyIORef' cnt (
-> (n+1,())) >>= evaluate)

  t1 <- getCPUTimeDouble

  -- make sure these stick around
  readIORef cnt >>= print
  readIORef arrsRef >>= (flip P.readArray 0 . head) >>= print

  putStrLn "The time:"
  print (t1 - t0)

带有 -hy 的堆配置文件主要显示 MUT_ARR_PTRS_CLEAN,我不完全理解.

A heap profile with -hy shows mostly MUT_ARR_PTRS_CLEAN, which I don't completely understand.

如果你想重现,这里是我一直在使用的 cabal 文件

If you want to reproduce, here is the cabal file I've been using

name:                small-concurrency-benchmarks
version:             0.1.0.0       
build-type:          Simple
cabal-version:       >=1.10

executable small-concurrency-benchmarks
  main-is:             Main.hs 
  build-depends:       base >=4.6
                     , criterion
                     , primitive

  default-language:    Haskell2010
  ghc-options: -O2  -rtsopts

编辑:这是另一个测试程序,可用于比较具有相同大小数组的堆与 [Integer] 的减速情况.调整 n 并观察分析以获得可比较的运行需要一些试验和错误.

Edit: Here's another test program, that can be used to compare slowdown with heaps of the same size of arrays vs [Integer]. It takes some trial and error adjusting n and observing profiling to get comparable runs.

main4 :: IO ()
main4= do
  [n] <- fmap (map read) getArgs
  let ns = [(1::Integer).. n]
  arrsRef <- newIORef ns
  print $ length ns

  t0 <- getCPUTimeDouble
  mapM (evaluate . sum) (tails [1.. 10000])
  t1 <- getCPUTimeDouble

  readIORef arrsRef >>= (print . sum)

  print (t1 - t0)

有趣的是,当我对此进行测试时,我发现相同堆大小的数组对性能的影响比 [Integer] 更大.例如

Interestingly, when I test this I find that the same heap size-worth of arrays affects performance to a greater degree than [Integer]. E.g.

         Baseline  20M   200M
Lists:   0.7       1.0    4.4
Arrays:  0.7       2.6   20.4

结论(WIP)

  1. 这很可能是由 GC 行为引起的

  1. This is most likely due to GC behavior

但可变的未装箱数组似乎会导致更严重的减速(见上文).设置+RTS -A200M 使数组垃圾版本的性能与列表版本一致,支持这与GC有关.

But mutable unboxed arrays seem to lead to more sever slowdowns (see above). Setting +RTS -A200M brings performance of the array garbage version in line with the list version, supporting that this has to do with GC.

减速与分配的数组数量成正比,而不是与数组中的总单元数成正比.这是一组运行,显示了与 main4 类似的测试,分配的数组数量对分配时间的影响,以及完全不相关的有效负载".这是总共 16777216 个单元格(分为许多数组):

The slowdown is proportional to the number of arrays allocated, not the number of total cells in the array. Here is a set of runs showing, for a similar test to main4, the effects of number of arrays allocated both on the time taken to allocate, and a completely unrelated "payload". This is for 16777216 total cells (divided amongst however many arrays):

   Array size   Array create time  Time for "payload":
   8            3.164           14.264
   16           1.532           9.008
   32           1.208           6.668
   64           0.644           3.78
   128          0.528           2.052
   256          0.444           3.08
   512          0.336           4.648
   1024         0.356           0.652

16777216*4 单元格上运行相同的测试,显示的有效负载时间与上述基本相同,仅向下移动了两个位置.

And running this same test on 16777216*4 cells, shows basically identical payload times as above, only shifted down two places.

根据我对 GHC 工作原理的了解,并查看 (3),我认为这种开销可能只是因为在 remembered set(另见:这里),以及任何导致GC.

From what I understand about how GHC works, and looking at (3), I think this overhead might be simply from having pointers to all these arrays sticking around in the remembered set (see also: here), and whatever overhead that causes for the GC.

推荐答案

你要为每个保持活跃并被提升到老一代的可变数组的每个次要 GC 支付线性开销.这是因为 GHC 无条件地将所有可变数组放在可变列表中,并在每次次要 GC 时遍历整个列表.有关详细信息,请参阅 https://ghc.haskell.org/trac/ghc/ticket/7662,以及我对您的问题的邮件列表回复:http://www.haskell.org/pipermail/glasgow-haskell-users/2014-May/024976.html

You are paying linear overhead every minor GC per mutable array that remains live and gets promoted to the old generation. This is because GHC unconditionally places all mutable arrays on the mutable list and traverses the entire list every minor GC. See https://ghc.haskell.org/trac/ghc/ticket/7662 for more information, as well as my mailing list response to your question: http://www.haskell.org/pipermail/glasgow-haskell-users/2014-May/024976.html

这篇关于随着分配更多的盒装数组,代码变得更慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆