加速Haskell并发 [英] Speed up Haskell concurrency

查看:148
本文介绍了加速Haskell并发的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

MVar,TVar,IORef,...我不能加速thunk问题(我想)。



(我的原始问题是一个线程代码,I doforkIOn次调用addMany;但我认为我的问题是shW函数)



让下面的代码:

  { - #LANGUAGE BangPatterns# - } 
import Control.Concurrent
import Control.Monad
import System.Environment getArgs)
import Data.Int
import Data.IORef

- i次,为每个IORef添加n(在a中)
addMany :: [IORef Int64] - > Int64 - > Int64 - > IO()
addMany!a!n!i =
forM_ [1..i](\_ - >
for M_ a(shW n))

- MVar,TVar,IORef,... read / write(x'= x + k)
shR = readIORef
shW!k!r = atomicModifyIORef r(\!x' >(x'+ k,()))

main = do
(n':i':_)< - getArgs
let =(read n',read i')
v < - forM [1..n](\_ - > newIORef 0)
addMany v 1 i
mapM shR v ;> =(putStrLn.show.sum)



  MUT时间3.12秒(已过去3.12秒)
GC时间6.96秒(已过去6.96秒)
...

成本中心模块条目%time%alloc%time%alloc

MAIN MAIN 47 0 0.0 0.0 100.0 100.0
main Main 95 0 0.0 0.0 100.0 100.0
main.i Main 100 1 0.0 0.0 0.0 0.0
addMany Main 99 1 0.4 0.5 100.0 100.0
addMany.\ Main 101 15000 6.6 0.0 99.6 99.5
shW Main 102 2250000 92.7 99.5 93.0 99.5
shW.\ Main 104 2250000 0.3 0.0 0.3 0.0



我不能在shW调用中清除thunk巨大)。错误?



类似的C#代码运行速度更快:

  class Node {
private object m;
private int n;

public Node(){n = 0; m = new object();}
public void Inc(){lock(m)n ++;}
public int Read(){return n;}
}

class MainClass {

public static void Main(string [] args){

var nitems = 1000;
var nthreads = 6;
var niters = 100000;

var g = Enumerable.Range(0,nitems).Select(i => new Node())。ToArray();
Task.WaitAll(Enumerable.Range(0,nthreads).Select(q => Task.Factory.StartNew(())=> {
var z = niters;
while z-> 0)
foreach(var i in g)
i.Inc();
}))ToArray());

Console.WriteLine(Sum node values:{0},g.Sum(i => i.Read()));

}
}

>

UPDATE



解决原始问题:,但是,至少在2007年,当我写了 strict-concurrency 包,以避免MVars的这个问题。在2009年讨论,我们现在在2012年有严格的版本)。



通过首先要求值,您可以删除该问题。例如。

  shW!k!r = --atomicModifyIORef r(\x  - >(x + k, 
do x< - shR r
writeIORef r $! (x + 1)

请注意,此问题现在 a>,您可以使用 atomicModifyIORef'来避免它。



我们得到:

  USDGHTVVH1 $ time ./A 1000 10000 + RTS -s 
10000000
堆中分配的120,738,836字节
在GC期间复制3,758,476字节
73,020字节最大驻留(1个样本)
16,348字节最大坡度
1 MB使用总内存(由于碎片而丢失0 MB)

总时间(已过)平均暂停最大暂停
Gen 0 230 colls,0 par 0.00s 0.01s 0.0000s 0.0001s
Gen 1 1 colls,0 par 0.00s 0.00s 0.0002s 0.0002 s

INIT时间0.00s(经过0.00s)
MUT时间0.17s(经过0.17s)
GC时间0.00s(经过0.01s)
退出时间0.00s(0.00s已过)
总时间0.19s(已过去0.17s)

%GC时间0.0%(已过3.9%)

分配率643,940,458字节每MUT秒

生产率总用户的100.0%,已用时间的108.3%


实数0m0.218s
用户0m0.000s
sys 0m0.015s

也就是说,22倍速加速,内存使用变得恒定。对于笑,这里是新的堆配置文件:




MVar, TVar, IORef, ... I can't speedup a thunk problem (I think).

(My original problem is a threaded code, I do "forkIO" n-times calling "addMany"; but I think my problem is on "shW" function)

Let next code:

{-# LANGUAGE BangPatterns #-}
import Control.Concurrent
import Control.Monad
import System.Environment(getArgs)
import Data.Int
import Data.IORef

-- "i" times, add "n" for each IORef (in "a")
addMany :: [IORef Int64] -> Int64 -> Int64 -> IO ()
addMany !a !n !i =
  forM_ [1..i] (\_ ->
    forM_ a (shW n))

-- MVar, TVar, IORef, ... read/write (x' = x + k)
shR = readIORef
shW !k !r = atomicModifyIORef r (\ !x' -> (x' + k, ()))

main = do
  (n':i':_) <- getArgs
  let (n, i) = (read n', read i')
  v <- forM [1..n] (\_ -> newIORef 0)
  addMany v 1 i
  mapM shR v >>= (putStrLn.show.sum)

then, profile result show:

MUT     time    3.12s  (  3.12s elapsed)
GC      time    6.96s  (  6.96s elapsed)
...

COST CENTRE  MODULE                  no.     entries  %time %alloc   %time %alloc

MAIN         MAIN                     47           0    0.0    0.0   100.0  100.0
 main        Main                     95           0    0.0    0.0   100.0  100.0
  main.i     Main                    100           1    0.0    0.0     0.0    0.0
  addMany    Main                     99           1    0.4    0.5   100.0  100.0
   addMany.\ Main                    101       15000    6.6    0.0    99.6   99.5
    shW      Main                    102     2250000   92.7   99.5    93.0   99.5
     shW.\   Main                    104     2250000    0.3    0.0     0.3    0.0

I can't remove thunk on "shW" calls (and memory usage is huge). What wrong?

A similar C# code run much (much) faster:

class Node { 
    private object m; 
    private int n; 

    public Node() {n = 0; m = new object();} 
    public void Inc() {lock(m) n++;} 
    public int Read() {return n;} 
} 

class MainClass { 

    public static void Main(string[] args) { 

        var nitems = 1000; 
        var nthreads = 6; 
        var niters = 100000; 

        var g = Enumerable.Range(0, nitems).Select(i => new Node()).ToArray(); 
        Task.WaitAll(Enumerable.Range(0, nthreads).Select(q => Task.Factory.StartNew(() => { 
            var z = niters; 
            while(z-- > 0) 
                foreach(var i in g) 
                    i.Inc(); 
        })).ToArray()); 

        Console.WriteLine("Sum node values: {0}", g.Sum(i => i.Read())); 

    } 
} 

Thanks a lot!

UPDATE

Solved original problem at: https://gist.github.com/3742897

Thank you very much Don Stewart!

解决方案

It is immediately obvious there is a leak when you look at the heap and the GC time:

USDGHTVVH1$ time ./A 1000 10000 +RTS -s
10000000
   1,208,298,840 bytes allocated in the heap
   1,352,861,868 bytes copied during GC
     280,181,684 bytes maximum residency (9 sample(s))
       4,688,680 bytes maximum slop
             545 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0      1677 colls,     0 par    2.27s    2.22s     0.0013s    0.0063s
  Gen  1         9 colls,     0 par    1.66s    1.77s     0.1969s    1.0273s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    0.70s  (  0.77s elapsed)
  GC      time    3.92s  (  4.00s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    4.62s  (  4.77s elapsed)

  %GC     time      84.8%  (83.8% elapsed)

  Alloc rate    1,718,469,461 bytes per MUT second

  Productivity  15.2% of total user, 14.7% of total elapsed

  real    0m4.752s
  user    0m0.015s
  sys     0m0.046s

280M residency and 89% GC. A lot of thunks are being allocated and thrown away.

A heap profile makes this obivous.

The clue is that these are "stg_app*" thingies (i.e. STG machine apply thunks).

A subtle, but remarkable, issue with the modify family of functions is the issue here -- when you have a lazy atomicModify, there simply is no way to strictly update the field, without demanding the value.

So all your careful use of atomicModifyIORef r (\ !x' -> (x' + k, ())) does is build a chain of applications of the (+) function, such that the result of the chain (i.e. the value in the cell) is observed, each addition will be strict in its argument. Not what you want! None of your strictness annotations on the argument to modifyIORef will have any effect on the cell itself. Now, normally a lazy modify is what you want -- it is just a pointer swap , so you can have very short atomic sections.

But sometimes that's not what you want.

(For the background to this issue see GHC ticket #5926, however, the issue was known at least in 2007, when I wrote the strict-concurrency package to avoid this issue with MVars. It was discussed in 2009, and we now have strict versions in 2012).

By demanding the value first, you can remove the issue. E.g.

shW !k !r = --atomicModifyIORef r (\x -> (x + k, ()))
    do x <- shR r
       writeIORef r $! (x+1)

Note that this issue is now documented in the libraries, and you can use atomicModifyIORef' to avoid it.

And we get:

USDGHTVVH1$ time ./A 1000 10000 +RTS -s
10000000
     120,738,836 bytes allocated in the heap
       3,758,476 bytes copied during GC
          73,020 bytes maximum residency (1 sample(s))
          16,348 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0       230 colls,     0 par    0.00s    0.01s     0.0000s    0.0001s
  Gen  1         1 colls,     0 par    0.00s    0.00s     0.0002s    0.0002s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    0.17s  (  0.17s elapsed)
  GC      time    0.00s  (  0.01s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    0.19s  (  0.17s elapsed)

  %GC     time       0.0%  (3.9% elapsed)

  Alloc rate    643,940,458 bytes per MUT second

  Productivity 100.0% of total user, 108.3% of total elapsed


real    0m0.218s
user    0m0.000s
sys     0m0.015s

That is, a 22x speedup, and memory use becomes constant. For laughs, here's the new heap profile:

这篇关于加速Haskell并发的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆