Scala ParArray排序 [英] Scala ParArray Sorting

查看:377
本文介绍了Scala ParArray排序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何按升序排序 ParArray 集合如

  ParArray(1,3,2)

否则,并行集合可能更适合此目的?



更新



如何在 ParArray ,这可能比转换到非并行集合进行顺序排序更有效率。

解决方案

blockquote>

如何在ParArray上实现一个并行算法,可以证明比将
序列转换为非并行集合更有效的
效率?


我的第一个建议是,似乎没有太多的性能损失转换并行数组顺序和返回:

  def time [R](block:=> R):R = {
val t0 = System.nanoTime()
val result = block // call-by-name
val t1 = System.nanoTime()
val diff:Long = t1 - t0
println(sElapsed time:$ {diff * 1.0 / 1E9} s)
result
}

def main(args:Array [String]):Unit = {
val size:Int = args.headOption。 map(_.toInt).getOrElse(1000000)
val input = Array.fill(size)(Random.nextInt())
val arrayCopy:Array [Int] = Array.ofDim b $ b input.copyToArray(arrayCopy)
time {input.sorted}
val parArray = arrayCopy.par
val result = time {parArray.seq.sorted.toArray.par}
}

给予

 > run 1000000 
[info] Running Runner 1000000
已用时间:0.344659236s
已用时间:0.321363896s

对于所有 Array ,我测试的结果非常相似,通常以某种方式赞成第二个表达式。所以,如果你担心转换为顺序集合和回来将杀死在其他操作上实现的性能增益 - 我不认为你应该是。



当它使用Scala的并行集合来实现并行排序,在某些情况下,它的性能会优于默认 - 我不认为有一个明显的好办法,但它不会伤害尝试:



我认为应该工作将分割输入数组到你的计算机核心(最好没有任何不必要的复制)和并发排序的子数组。之后,您可以将部分合并(如合并排序)。代码如下所示:

  val maxThreads = 8 //为了简单起见,我们不是显式地配置线程池
val groupSize:Int = size / maxThreads + 1
val ranges:IndexedSeq [(Int,Int)] =(0到maxThreads).map(i = )* groupSize))
time {
//每个范围的并行排序
ranges.par.foreach {case(from,to)=>
input.view(from,to).sortWith(_< _)
}
// TODO将零件合并
}

不幸的是,有这个老bug 阻止我们做任何有趣的视图。似乎没有任何Scala内置的机制(视图除外)只用于排序集合的一部分。这就是为什么我试图编码我的自己的合并排序算法签名的def mergeSort(a:Array [Int],r:Range):单位使用它,以上。不幸的是,它似乎比scala Array.sorted 方法的效率低4倍以上,所以我不认为它可以用来提高效率超过标准的顺序方法。



如果我正确理解你的情况,你的数据集适合内存,所以使用像Hadoop和MapReduce这样的东西是不成熟的。您可能会尝试 Apache Spark - 除了添加依赖关系,您不需要设置任何集群或安装任何Spark以在基本配置中使用机器的所有内核。它的RDD在思想上类似于Scala的并行集合,但具有额外的功能。他们(在某种程度上)支持并行排序。


How to sort in ascending order a ParArray collection such as

ParArray(1,3,2)

or else, which parallel collections may be more suitable for this purpose ?

Update

How to implement a parallel algorithm on ParArray that may prove more efficient than casting to a non parallel collection for sequential sorting ?

解决方案

How to implement a parallel algorithm on ParArray that may prove more efficient than casting to a non parallel collection for sequential sorting?

My first obvervation would be that there doesn't seem to be much performance penalty for "converting" parallel arrays to sequential and back:

def time[R](block: => R): R = {
  val t0 = System.nanoTime()
  val result = block    // call-by-name
  val t1 = System.nanoTime()
  val diff: Long = t1 - t0
  println(s"Elapsed time: ${diff * 1.0 / 1E9}s")
  result
}

def main(args: Array[String]): Unit = {
  val size: Int = args.headOption.map(_.toInt).getOrElse(1000000)
  val input = Array.fill(size)(Random.nextInt())
  val arrayCopy: Array[Int] = Array.ofDim(size)
  input.copyToArray(arrayCopy)
  time { input.sorted }
  val parArray = arrayCopy.par
  val result = time { parArray.seq.sorted.toArray.par }
}

gives

> run 1000000
[info] Running Runner 1000000
Elapsed time: 0.344659236s
Elapsed time: 0.321363896s

For all Array sizes I tested the results are very similar and usually somehow in favor of the second expression. So in case you were worried that converting to sequential collections and back will kill the performance gains you achieved on other operations - I don't think you should be.

When it comes to utilizing Scala's parallel collections to achieve parallel sorting that in some cases would perform better than the default - I don't think there's an obvious good way of doing that, but it wouldn't hurt to try:

What I thought should work would be splitting the input array into as many subarrays as you have cores in your computer (preferably without any unnecessary copying) and sorting the parts concurrently. Afterwards one might merge (as in merge sort) the parts together. Here's how the code might look like:

val maxThreads = 8 //for simplicity we're not configuring the thread pool explicitly
val groupSize:Int = size/maxThreads + 1
val ranges: IndexedSeq[(Int, Int)] = (0 until maxThreads).map(i => (i * groupSize, (i + 1) * groupSize))
time {
  //parallelizing sorting for each range
  ranges.par.foreach {case (from, to) =>
    input.view(from, to).sortWith(_ < _)
  }
  //TODO merge the parts together
}

Unfortunately there's this old bug that prevents us from doing anything fun with views. There doesn't seem to be any Scala built-in mechanism (other than views) for sorting just a part of a collection. This is why I tried coding my own merge sort algorithm with the signature of def mergeSort(a: Array[Int], r: Range): Unit to use it as I described above. Unfortunately it seems to be more than 4 times less effective than the scala Array.sorted method so I don't think it could be used to gain efficiency over the standard sequential approach.

If I understand your situation correctly, your dataset fits in memory, so using something like Hadoop and MapReduce would be premature. What you might try though would be Apache Spark - other than adding a dependency you wouldn't need to set up any cluster or install anything for Spark to use all cores of your machine in a basic configuration. Its RDD's are ideologically similar to Scala's Parallel Collections, but with additional functionalities. And they (in a way) support parallel sorting.

这篇关于Scala ParArray排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆