随着分配更多的盒装数组，代码变得更慢 [英] Code becomes slower as more boxed arrays are allocated

查看：10 发布时间：2022/1/16 10:28:59 arrays performance haskell garbage-collection ghc

本文介绍了随着分配更多的盒装数组，代码变得更慢的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

事实证明，事情通常(不仅仅是数组/引用操作)会减慢创建更多数组的速度，所以我猜这可能只是测量增加的 GC 时间和可能没有我想的那么奇怪.但是我真的很想知道(并学习如何找出)这里发生了什么，以及是否有某种方法可以在创建大量小型数组的代码中减轻这种影响.原始问题如下.

在调查库中一些奇怪的基准测试结果时，我偶然发现了一些我不理解的行为，尽管它可能非常明显.许多操作(创建新的MutableArray，读取或修改IORef)所花费的时间似乎与内存中数组的数量成正比.

In investigating some weird benchmarking results in a library, I stumbled upon some behavior I don't understand, though it might be really obvious. It seems that the time taken for many operations (creating a new MutableArray, reading or modifying an IORef) increases in proportion to the number of arrays in memory.

这是第一个例子:

module Main
    where 

import Control.Monad
import qualified Data.Primitive as P
import Control.Concurrent
import Data.IORef
import Criterion.Main
import Control.Monad.Primitive(PrimState)

main = do 
  let n = 100000
  allTheArrays <- newIORef []
  defaultMain $
    [ bench "array creation" $ do
         newArr <- P.newArray 64 () :: IO (P.MutableArray (PrimState IO) ())
         atomicModifyIORef'  allTheArrays (l-> (newArr:l,()))
    ]

我们正在创建一个新数组并将其添加到堆栈中.随着标准执行更多样本并且堆栈增长，数组创建需要更多时间，而且这似乎线性且有规律地增长:

We're creating a new array and adding it to a stack. As criterion does more samples and the stack grows, array creation takes more time, and this seems to grow linearly and regularly:

更奇怪的是，IORef 的读取和写入受到影响，我们可以看到 atomicModifyIORef' 变得越来越快，大概是因为更多的数组被 GC'd.

Even more odd, IORef reads and writes are affected, and we can see the atomicModifyIORef' getting faster presumably as more arrays are GC'd.

main = do 
  let n = 1000000
  arrs <- replicateM (n) $ (P.newArray 64 () :: IO (P.MutableArray (PrimState IO) ()))
  -- print $ length arrs -- THIS WORKS TO MAKE THINGS FASTER
  arrsRef <- newIORef arrs
  defaultMain $
    [ bench "atomic-mods of IORef" $
    -- nfIO $           -- OR THIS ALSO WORKS
        replicateM 1000 $
          atomicModifyIORef' arrsRef ((a:as)-> (as,()))
    ]

被注释的两行中的任何一个都摆脱了这种行为，但我不确定为什么(也许在我们强制列表的脊椎之后，元素实际上可以被收集).

Either of the two lines that are commented get rid of this behavior but I'm not sure why (maybe after we force the spine of the list, the elements can actually by collected).

这里发生了什么?
这是预期的行为吗?
有什么办法可以避免这种放缓?

编辑:我认为这与 GC 需要更长的时间有关，但我想更准确地了解正在发生的事情，尤其是在第一个基准测试中.

Edit: I assume this has something to do with GC taking longer, but I'd like to understand more precisely what's happening, especially in the first benchmark.

最后，这是一个简单的测试程序，可用于预先分配一些数组并为一堆 atomicModifyIORef 计时.这似乎表现出缓慢的 IORef 行为.

Finally, here's a simple test program that can be used to pre-allocate some number of arrays and time a bunch of atomicModifyIORefs. This seems to exhibit the slow IORef behavior.

import Control.Monad
import System.Environment

import qualified Data.Primitive as P
import Control.Concurrent
import Control.Concurrent.Chan
import Control.Concurrent.MVar
import Data.IORef
import Criterion.Main
import Control.Exception(evaluate)
import Control.Monad.Primitive(PrimState)

import qualified Data.Array.IO as IO
import qualified Data.Vector.Mutable as V

import System.CPUTime
import System.Mem(performGC)
import System.Environment
main :: IO ()
main = do
  [n] <- fmap (map read) getArgs
  arrs <- replicateM (n) $ (P.newArray 64 () :: IO (P.MutableArray (PrimState IO) ()))
  arrsRef <- newIORef arrs

  t0 <- getCPUTimeDouble

  cnt <- newIORef (0::Int)
  replicateM_ 1000000 $
    (atomicModifyIORef' cnt (
-> (n+1,())) >>= evaluate)

  t1 <- getCPUTimeDouble

  -- make sure these stick around
  readIORef cnt >>= print
  readIORef arrsRef >>= (flip P.readArray 0 . head) >>= print

  putStrLn "The time:"
  print (t1 - t0)

带有 -hy 的堆配置文件主要显示 MUT_ARR_PTRS_CLEAN，我不完全理解.

A heap profile with -hy shows mostly MUT_ARR_PTRS_CLEAN, which I don't completely understand.

如果你想重现，这里是我一直在使用的 cabal 文件

If you want to reproduce, here is the cabal file I've been using

name:                small-concurrency-benchmarks
version:             0.1.0.0       
build-type:          Simple
cabal-version:       >=1.10

executable small-concurrency-benchmarks
  main-is:             Main.hs 
  build-depends:       base >=4.6
                     , criterion
                     , primitive

  default-language:    Haskell2010
  ghc-options: -O2  -rtsopts

编辑:这是另一个测试程序，可用于比较具有相同大小数组的堆与 [Integer] 的减速情况.调整 n 并观察分析以获得可比较的运行需要一些试验和错误.

Edit: Here's another test program, that can be used to compare slowdown with heaps of the same size of arrays vs [Integer]. It takes some trial and error adjusting n and observing profiling to get comparable runs.

main4 :: IO ()
main4= do
  [n] <- fmap (map read) getArgs
  let ns = [(1::Integer).. n]
  arrsRef <- newIORef ns
  print $ length ns

  t0 <- getCPUTimeDouble
  mapM (evaluate . sum) (tails [1.. 10000])
  t1 <- getCPUTimeDouble

  readIORef arrsRef >>= (print . sum)

  print (t1 - t0)

有趣的是，当我对此进行测试时，我发现相同堆大小的数组对性能的影响比 [Integer] 更大.例如

Interestingly, when I test this I find that the same heap size-worth of arrays affects performance to a greater degree than [Integer]. E.g.

         Baseline  20M   200M
Lists:   0.7       1.0    4.4
Arrays:  0.7       2.6   20.4

结论(WIP)

这很可能是由 GC 行为引起的

This is most likely due to GC behavior

但可变的未装箱数组似乎会导致更严重的减速(见上文).设置+RTS -A200M 使数组垃圾版本的性能与列表版本一致，支持这与GC有关.

But mutable unboxed arrays seem to lead to more sever slowdowns (see above). Setting +RTS -A200M brings performance of the array garbage version in line with the list version, supporting that this has to do with GC.

减速与分配的数组数量成正比，而不是与数组中的总单元数成正比.这是一组运行，显示了与 main4 类似的测试，分配的数组数量对分配时间的影响，以及完全不相关的有效负载".这是总共 16777216 个单元格(分为许多数组):

The slowdown is proportional to the number of arrays allocated, not the number of total cells in the array. Here is a set of runs showing, for a similar test to main4, the effects of number of arrays allocated both on the time taken to allocate, and a completely unrelated "payload". This is for 16777216 total cells (divided amongst however many arrays):

   Array size   Array create time  Time for "payload":
   8            3.164           14.264
   16           1.532           9.008
   32           1.208           6.668
   64           0.644           3.78
   128          0.528           2.052
   256          0.444           3.08
   512          0.336           4.648
   1024         0.356           0.652

在 16777216*4 单元格上运行相同的测试，显示的有效负载时间与上述基本相同，仅向下移动了两个位置.

And running this same test on 16777216*4 cells, shows basically identical payload times as above, only shifted down two places.

根据我对 GHC 工作原理的了解，并查看 (3)，我认为这种开销可能只是因为在 remembered set(另见:这里)，以及任何导致GC.

From what I understand about how GHC works, and looking at (3), I think this overhead might be simply from having pointers to all these arrays sticking around in the remembered set (see also: here), and whatever overhead that causes for the GC.

随着分配更多的盒装数组，代码变得更慢 [英] Code becomes slower as more boxed arrays are allocated

问题描述

结论(WIP)

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

随着分配更多的盒装数组，代码变得更慢 [英] Code becomes slower as more boxed arrays are allocated

问题描述

结论(WIP)

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭