在Haskell程序中分析多线程性能 - 没有使用并行策略的加速 [英] Profiling multithreading performance in a Haskell program — no speedups using parallel strategies

查看:156
本文介绍了在Haskell程序中分析多线程性能 - 没有使用并行策略的加速的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

试图在Haskell程序中添加多线程功能后,我注意到性能没有得到任何改善。追查下来,我从threadscope得到了以下数据: b
$ b


绿色表示正在运行,而橙色是垃圾回收。


这里垂直绿色条表示创建火花,蓝色条表示并行GC请求,浅蓝色条表示线程创建。

标签为:创建spark,请求并行GC,创建线程n,以及窃取第2章的火花。

平均而言,我在4个内核上的活动平均只有25%,这在单线程程序中并没有改进。

当然,如果没有对实际程序的描述,问题将会失效。本质上,我创建了一个可遍历的数据结构(例如树),然后将fmap函数映射到它之前,然后将其馈送到图像写入例程(在程序运行结束时解释明确的单线程段,过去15s) 。这个函数的构造和映射都需要大量的时间来运行,但第二个稍微多一点。

上面的图是通过添加一个parTraversable策略在数据结构被图像写入消耗之前。我也尝试在数据结构上使用toList,然后使用各种并行列表策略(parList,parListChunk,parBuffer),但每次对于大范围参数(即使使用大块),结果都是相似的。

我也尝试在函数映射之前完全评估可遍历的数据结构,但是发生了完全相同的问题。

以下是一些额外的统计信息

 在堆中分配的5,702,829,756字节
在GC
中复制的385,998,024字节最大居民身份(8个样本)55,819,120字节
1,392,044字节最大污水
133 MB使用的总内存(由于分段造成的损失0 MB)

总时间(已用) Avg pause Max pause
Gen 0 10379 colls,10378 par 5.20s 1.40s 0.0001s 0.0327s
Gen 1 8 colls,8 par 1.01s 0.2 5s 0.0319s 0.0509s

并行GC工作余额:1.24(96361163/77659897,理想4)

MUT时间(已用)GC时间(已用)
任务0(工人):0.00s(15.92s)0.02s(0.02s)
任务1(工人):0.27s(14.00s)1.86s(1.94s)
任务2(限定):14.24 (14.30s)1.61s(1.64s)
任务3(工人):0.00s(15.94s)0.00s(0.00s)
任务4(工人):0.25s(14.00s)1.66 (1.93s)
任务5(工人):0.27s(14.09s)1.69s(1.84s)

SPARKS:595854(595854转换,0溢出,0失败,0 GC 'd,0失败)

初始时间0.00s(已过去0.00s)
MUT时间15.67s(已过去14.28s)
GC时间6.22s(已过去1.66s)
退出时间0.00s(经过0.00s)
总时间21.89s(已过15.94s)

分配率363,769,460字节pe r MUT第二个

生产力总用户的71.6%,已用完总数的98.4%



<我不确定我可以提供哪些其他有用的信息来协助回答。性能分析并没有显示出任何有趣的结果:它与单核心统计数据相同,除了增加的空闲时间占用75%的时间,正如上面的预期。



发生什么事情阻止了有用的并行处理?

解决方案

对不起,我无法及时提供代码来协助受访者。我花了一段时间才找出问题的确切位置。



问题如下:我正在映射一个函数

  f :: a  - > S b 

遍历数据结构

  structure:T a 

其中S和T是两个

然后,当使用parTraversable时,我错误地写了

 使用`parTraversable rdeepseq 

而不是

$ b编写(fmap f structure)
$ b

 编写$ fmap f structure`using` parTraversable rdeepseq 

所以我错误地将Traversable实例用于Compose TS来执行多线程(使用Data.Functor.Compose)。



(这看起来像它应该很容易被捕获,但是我花了一段时间才从代码中提取上述错误!)

现在看起来好多了!






After attempting to add multithreading functionality in a Haskell program, I noticed that performance didn't improve at all. Chasing it down, I got the following data from threadscope:

Green indicates running, and orange is garbage collection. Here vertical green bars indicate spark creation, blue bars are parallel GC requests, and light blue bars indicate thread creation. The labels are: spark created, requesting parallel GC, creating thread n, and stealing spark from cap 2.

On average, I'm only getting about 25% activity over 4 cores, which is no improvement at all over the single-threaded program.

Of course, the question would be void without a description of the actual program. Essentially, I create a traversable data structure (e.g. a tree), and then fmap a function over it, before then feeding it into an image writing routine (explaining the unambiguously single-threaded segment at the end of the program run, past 15s). Both the construction and the fmapping of the function take a significant amount of time to run, although the second slightly more so.

The above graphs were made by adding a parTraversable strategy for that data structure before it is consumed by the image writing. I have also tried using toList on the data structure and then using various parallel list strategies (parList, parListChunk, parBuffer), but the results were similar each time for a wide range of parameters (even using large chunks).
I also tried to fully evaluate the traversable data structure before fmapping the function over it, but the exact same problem occurred.

Here are some additional statistics (for a different run of the same program):

   5,702,829,756 bytes allocated in the heap
     385,998,024 bytes copied during GC
      55,819,120 bytes maximum residency (8 sample(s))
       1,392,044 bytes maximum slop
             133 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause 
  Gen  0     10379 colls, 10378 par    5.20s    1.40s     0.0001s    0.0327s
  Gen  1         8 colls,     8 par    1.01s    0.25s     0.0319s    0.0509s

  Parallel GC work balance: 1.24 (96361163 / 77659897, ideal 4)

                        MUT time (elapsed)       GC time  (elapsed)
  Task  0 (worker) :    0.00s    ( 15.92s)       0.02s    (  0.02s)
  Task  1 (worker) :    0.27s    ( 14.00s)       1.86s    (  1.94s)
  Task  2 (bound)  :   14.24s    ( 14.30s)       1.61s    (  1.64s)
  Task  3 (worker) :    0.00s    ( 15.94s)       0.00s    (  0.00s)
  Task  4 (worker) :    0.25s    ( 14.00s)       1.66s    (  1.93s)
  Task  5 (worker) :    0.27s    ( 14.09s)       1.69s    (  1.84s)

  SPARKS: 595854 (595854 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time   15.67s  ( 14.28s elapsed)
  GC      time    6.22s  (  1.66s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time   21.89s  ( 15.94s elapsed)

  Alloc rate    363,769,460 bytes per MUT second

  Productivity  71.6% of total user, 98.4% of total elapsed

I'm not sure what other useful information I can give to assist answering. Profiling doesn't reveal anything interesting: it's the same as the single core statistics, except with an added IDLE taking up 75% of the time, as expected from the above.

What's happening that's preventing useful parallelisation?

解决方案

Sorry that I couldn't provide code in a timely manner to assist respondents. It took me a while to untangle the exact location of the issue.

The problem was as follows: I was fmapping a function

f :: a -> S b

over the traversable data structure

structure :: T a

where S and T are two traversable functors.

Then, when using parTraversable, I was mistakenly writing

Compose (fmap f structure) `using` parTraversable rdeepseq

instead of

Compose $ fmap f structure `using` parTraversable rdeepseq

so I was wrongly using the Traversable instance for Compose T S to do the multithreading (using Data.Functor.Compose).

(This looks like it should've been easy to catch, but it took me a while to extract the above mistake from the code!)

This now looks much better!

这篇关于在Haskell程序中分析多线程性能 - 没有使用并行策略的加速的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆