在Haskell中没有加速的天真合并排序并行化 [英] No speedup with naive merge sort parallelization in Haskell

查看:106
本文介绍了在Haskell中没有加速的天真合并排序并行化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

注意:这篇文章被完全重写了2011-06-10;感谢Peter帮助我。另外,如果我不接受一个答案,请不要冒犯,因为这个问题似乎是相当开放的。 (但是,如果你解决了这个问题,当然会得到复选标记)。

另外一位用户发布了关于并行排序的问题。我认为我会写一个简单的解决方案,但唉,它并不比顺序版本快得多。



问题陈述



合并排序是一个分而治之的算法,其中计算的结果可以并行化。

O (n log n)算法转换为具有无限处理器的 O (n)算法。 / p>

当参数 l (level)大于零时,计算的第一步是并行化的。这是通过[通过变量 strat ]选择 rpar 策略完成的,这将使子计算 mergeSort'x > mergeSort'y 。然后,我们合并结果,并使用 rdeepseq 强制其评估。


$ b

  data Tree a = Leaf a |节点(树a)(树a)派生(显示)

实例NFData a => NFData(Tree a)其中
rnf(Leaf v)= deepseq v()
rnf(Node xy)= deepseq(x,y)()

listToTree [] =错误listToTree - 空列表
listToTree [x] = Leaf x
listToTree xs = uncurry Node $ listToTree *** listToTree $
splitAt(length xs`div` 2)xs

- mergeSort':: Ord a =>树a - > Eval [a]
mergeSort'l(Leaf v)= return [v]
mergeSort'l(Node xy)= do
xr < - strat $ runEval $ mergeSort'(l - 1)x
yr < - rseq $ runEval $ mergeSort'(l - 1)y
rdeepseq(合并xr yr)
其中
合并[] y = y
合并x [] = x
merge(x:xs)(y:ys)| x < y = x:合并xs(y:ys)
|否则= y:合并(x:xs)ys
strat | l> 0 = rpar
|否则= rseq

mergeSort = runEval。 mergeSort'10

通过只计算几个级别的计算,我们应该有体面的并行通信复杂性 - 一些常数因子顺序 n
$ b $ h2结果

在这里获取第4个版本的源代码[ http://pastebin.com/DxYneAaC ],并使用以下代码检查线程使用情况或后续命令行进行基准测试,

  rm -f ParallelMergeSort; ghc -O2 -O3 -optc-O3 -optc -ffast-math -eventlog --make -rtsopts -threaded ParallelMergeSort.hs 
./ParallelMergeSort + RTS -H512m -K512m -ls -N
threadscope ParallelMergeSort .eventlog

24核心X5680 @ 3.33GHz的结果显示改善不大

 > ./ParallelMergeSort 
初始化:10.461204秒。
排序:6.383197秒。
> ./ParallelMergeSort + RTS -H512m -K512m -N
初始化:27.94877秒。
排序:5.228463秒。

在我自己的机器上,一个四核Phenom II,

 > ./ParallelMergeSort 
初始化:18.943919秒。
排序:10.465077秒。
> ./ParallelMergeSort + RTS -H512m -K512m -ls -N
初始化:22.92075秒。
排序:7.431716秒。

检查threadscope中的结果显示了少量数据的良好利用率。 (但是,可悲的是,没有明显的加速)。但是,当我尝试在更大的列表上运行它时,它会在一半时间内使用大约2个点。似乎很多火花正在被修剪。它对内存参数也非常敏感,其中256mb是最佳选择,128mb为9秒,512为8.4,1024为12.3!

我正在寻找的解决方案



最后,如果有人知道一些高功率的工具可以抛弃它,我会很感激。 (伊甸园?)。我对Haskell并行性的主要兴趣是能够为研究项目编写小型支持工具,我可以在实验室集群中的24或80核心服务器上运行它。由于它们不是我们小组研究的主要观点,我不想花太多时间在并行效率上。所以,对我来说,简单就好,即使我最终只能得到20%的使用量。

进一步讨论




  • 我注意到threadscope中的第二个栏有时是绿色的(参见首页,第二栏似乎总是垃圾回收)。这是什么意思?

  • 有没有办法避开垃圾回收?这似乎花费了很多时间。例如,为什么不能将一个子计算分叉,将结果返回到共享内存中,然后死亡?

  • 是否有更好的方式表示并行性(箭头,应用)?
  • li>

解决方案

答案很简单:因为您没有引入并行性。 Eval 只是一个monad命令计算,您必须要求手动并行执行。你可能想要的是:

  do xr < -  rpar $ runEval $ mergeSort'x 
yr < - rseq $ runEval $ mergeSort'y
rseq(合并xr yr)

这将使Haskell实际上为第一次计算创造了一个火花,而不是试图现场评估它。



标准提示也适用:


  1. 结果应该深入评估(例如使用 evalTraversable rseq )。否则,你只会强制树的头部,大部分数据将被返回未评估。

  2. 只要引发一切,最有可能会消耗任何收益。引入一个参数可以在较低的递归级别停止引发是个好主意。

编辑:在问题编辑之后不再适用



但是最糟糕的部分是最后一个:你的算法在陈述它的时候是非常有缺陷的。您的顶级 seq 只会强制列表中的第一个cons-cell,这可以让GHC使用懒惰效果很好。它永远不会真正构建结果列表,只是通过搜索最小元素(这甚至不是必需的,但GHC只在知道最小值之后才会生成单元格)来搜索所有元素。



所以,当你开始引入并行性时,性能实际上急剧下降,不要感到惊讶,因为假设你需要程序中某个点的整个列表......

编辑2:编辑的更多答案

程序最大的问题可能是它使用了列表。如果你想做的不仅仅是一个玩具的例子,至少应该考虑使用(未打包的)数组。如果你想进行严格的数字处理,可以考虑一个像维修的专门图书馆。



关于进一步讨论:


  • 颜色表示不同的GC州,我不记得哪个。尝试查看关联事件的事件日志。

  • 垃圾回收的方式一开始不会产生太多垃圾,例如通过使用更好的数据结构。

  • 好吧,如果你正在寻找一种强大的并行化的灵感,那么可以看一看 monad-par ,这是相对较新的,但(我觉得)在它的平行行为中不那么令人惊讶




使用monad-par,您的示例可能会变成如下形式:

  do xr<  -  spawn $ mergeSort'x 
yr< - spawn $ mergeSort'y
merge< $>得到xr< *> get yr

所以这里 get 实际上是强制的您指定连接点 - 库在后台自动执行所需的 deepseq


Note: This post was completely rewritten 2011-06-10; thanks to Peter for helping me out. Also, please don't be offended if I don't accept one answer, since this question seems to be rather open-ended. (But, if you solve it, you get the check mark, of course).

Another user had posted a question about parallelizing a merge sort. I thought I'd write a simple solution, but alas, it is not much faster than the sequential version.

Problem statement

Merge sort is a divide-and-conquer algorithm, where the leaves of computation can be parallelized.

The code works as follows: the list is converted into a tree, representing computation nodes. Then, the merging step returns a list for each node. Theoretically, we should see some significant performanc gains, since we're going from an O(n log n) algorithm to an O(n) algorithm with infinite processors.

The first steps of the computation are parallelized, when parameter l (level) is greater than zero below. This is done by [via variable strat] selecting the rpar strategy, which will make sub-computation mergeSort' x occur in parallel with mergeSort' y. Then, we merge the results, and force its evaluation with rdeepseq.

data Tree a = Leaf a | Node (Tree a) (Tree a) deriving (Show)

instance NFData a => NFData (Tree a) where
    rnf (Leaf v) = deepseq v ()
    rnf (Node x y) = deepseq (x, y) ()

listToTree [] = error "listToTree -- empty list"
listToTree [x] = Leaf x
listToTree xs = uncurry Node $ listToTree *** listToTree $
    splitAt (length xs `div` 2) xs

-- mergeSort' :: Ord a => Tree a -> Eval [a]
mergeSort' l (Leaf v) = return [v]
mergeSort' l (Node x y) = do
    xr <- strat $ runEval $ mergeSort' (l - 1) x
    yr <- rseq $ runEval $ mergeSort' (l - 1) y
    rdeepseq (merge xr yr)
    where
        merge [] y = y
        merge x [] = x
        merge (x:xs) (y:ys) | x < y = x : merge xs (y:ys)
                            | otherwise = y : merge (x:xs) ys
        strat | l > 0 = rpar
              | otherwise = rseq

mergeSort = runEval . mergeSort' 10

By only evaluating a few levels of the computation, we should have decent parallel communication complexity as well -- some constant factor order of n.

Results

Obtain the 4th version source code here [ http://pastebin.com/DxYneAaC ], and run it with the following to inspect thread usage, or subsequent command lines for benchmarking,

rm -f ParallelMergeSort; ghc -O2 -O3 -optc-O3 -optc-ffast-math -eventlog --make -rtsopts -threaded ParallelMergeSort.hs
./ParallelMergeSort +RTS -H512m -K512m -ls -N
threadscope ParallelMergeSort.eventlog

Results on a 24-core X5680 @ 3.33GHz show little improvement

> ./ParallelMergeSort 
initialization: 10.461204s sec.
sorting: 6.383197s sec.
> ./ParallelMergeSort +RTS -H512m -K512m -N
initialization: 27.94877s sec.
sorting: 5.228463s sec.

and on my own machine, a quad-core Phenom II,

> ./ParallelMergeSort 
initialization: 18.943919s sec.
sorting: 10.465077s sec.
> ./ParallelMergeSort +RTS -H512m -K512m -ls -N
initialization: 22.92075s sec.
sorting: 7.431716s sec.

Inspecting the result in threadscope shows good utilization for small amounts of data. (though, sadly, no perceptible speedup). However, when I try to run it on larger lists, like the above, it uses about 2 cpus half the time. It seems like a lot of sparks are getting pruned. It's also sensitive to the memory parameters, where 256mb is the sweet spot, 128mb gives 9 seconds, 512 gives 8.4, and 1024 gives 12.3!

Solutions I'm looking for

Finally, if anyone knows some high-power tools to throw at this, I'd appreciate it. (Eden?). My primary interest in Haskell parallelism is to be able to write small supportive tools for research projects, which I can throw on a 24 or 80 core server in our lab's cluster. Since they're not the main point of our group's research, I don't want to spend much time on the parallelization efficiency. So, for me, simpler is better, even if I only end up getting 20% usage.

Further discussion

  • I notice that the second bar in threadscope is sometimes green (c.f. its homepage, where the second bar seems to always be garbage collection). What does this mean?
  • Is there any way to sidestep garbage collection? It seems to be taking a lot of time. For example, why can't a subcomputation be forked, return the result in shared memory, and then die?
  • Is there a better way (arrows, applicative) to express parallelism?

解决方案

The answer is pretty easy: Because you have at no point introduced parallelism. Eval is just a monad to order computations, you have to ask for things to be executed in parallel manually. What you probably want is:

do xr <- rpar $ runEval $ mergeSort' x
   yr <- rseq $ runEval $ mergeSort' y
   rseq (merge xr yr)

This will make Haskell actually create a spark for the first computation, instead of trying to evaluate it on the spot.

Standard tips also kind-of apply:

  1. The result should be evaluated deeply (e.g. using evalTraversable rseq). Otherwise you will only force the head of the tree, and the bulk of the data will just be returned unevaluated.
  2. Just sparking everything will most likely eat up any gains. It would be a good idea to introduce a parameter that stops sparking at lower recursion levels.

Edit: The following actually doesn't apply anymore after the question edit

But the worst part last: Your algorithm as you state it is very flawed. Your top-level seq only forces the first cons-cell of the list, which allows GHC to use lazyness to great effect. It will never actually construct the result list, just plow through all of them in a search for the minimum element (that's not even strictly needed, but GHC only produces the cell after the minimum is known).

So don't be surprised when performance actually drops sharply when you start introducing parallelism under the assumptions that you need the whole list at some point in the program...

Edit 2: Some more answers to the edits

The biggest problem with your program is probably that it is using lists. If you want to make more than a toy example, consider at least using (unpacked) Arrays. If you want to go into serious number-crunching, maybe consider a specialised library like repa.

On "Further Discussion":

  • The colors stand for different GC states, I can't remember which. Try to look at the event log for the associated event.

  • The way to "sidestep" garbage collection is to not produce so much garbage in the first place, e.g. by using better data structures.

  • Well, if you are looking for an inspiration on robust parallelization it might be worthwhile to have a look at monad-par, which is relatively new but (I feel) less "surprising" in its parallel behaviour.

With monad-par, your example might become something like:

  do xr <- spawn $ mergeSort' x
     yr <- spawn $ mergeSort' y
     merge <$> get xr <*> get yr

So here the get actually forces you to specify the join points - and the library does the required deepseq automatically behind the scenes.

这篇关于在Haskell中没有加速的天真合并排序并行化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆