Scala中平行集合的效率/可伸缩性(图形) [英] Efficiency/scalability of parallel collections in Scala (graphs)

查看:197
本文介绍了Scala中平行集合的效率/可伸缩性(图形)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

因此,我一直在使用Scala中的并行集合来处理我正在处理的图形项目,我已经定义了图类的基础知识,它目前使用 scala.collection .mutable.HashMap 其中键 Int ,值为 ListBuffer [Int] (邻接表)。 (编辑:这已改为 ArrayBuffer [Int]



几个月后我做了类似的事情在C ++中,使用 std :: vector< int,std :: vector< int>>



我现在要做的是在图形中所有顶点对之间运行一个度量,所以在C ++中,我做了这样的事情:

 < (std :: vector< int> :: iterator iter = myVec.begin(); iter!= myVec.end()的code> // myVec = std :: vector< int> ; ++ iter){
for(std :: vector< int> :: iterator iter2 = myVec.begin();
iter2!= myVec.end(); ++ iter2){
/ *在* iter和* iter2 * /
}
}



之间运行算法


$ b

  // vertexList是在Scala中并行化(或试图):一个List [Int](NOW CHANGED TO Array [Int]  - 见下面)
vertexList.par.foreach(u =>
vertexList.foreach(v =>
/ * Run算法在u和v之间* /


C ++版本是显然是单线程的,Scala版本有 .par ,所以它使用并行集合,并且在8个内核(同一台机器)上是多线程的。然而,C ++版本在大约3天的时间内处理了305,570对,而到目前为止Scala版本在17个小时内只处理了23,573对。



假设我做了< a href =http://www.wolframalpha.com/input/?i=%28%2850%20%2a%206115%29%20/%20%283%2a24%29%29%20/%20% 2823573/17%29rel =nofollow> math 正确,单线程C ++版本比Scala版本快大约3倍。 Scala真的比C ++慢得多,还是我完全错误地使用Scala(我最近才开始 - 我在Scala中编程约300页)?



谢谢!
-kstruct


编辑要使用while循环,我要做些什么。

  // //其中vertexList是一个Array [Int] 
vertexList.par.foreach(u =>
while(i< 0直到vertexList.length){
/ *在u和vertexList之间运行算法(i)* /
}
}

如果你们的意思是在整个事情中使用while循环,那么对于whiles,是否有相当于 .par.foreach 的值?



EDIT2 等一下,代码甚至不对 - 我的不好。有一些 var i 跟踪迭代,那么并不是所有的线程都共享那个 i ?从你的评论中,我看到你更新了一个共享的可变 HashMap


解决方案每个算法的结束都会运行,如果你随机化你的散步,共享的 Random 也是一个争用点。 / p>

我推荐两项更改:


  1. 使用 .map .flatMap 返回一个不可变的集合,而不是修改共享集合。

  2. 使用 ThreadLocalRandom (来自 Akka Java 7 a>)以减少随机数生成器上的争用。
  3. 检查算法的其余部分以获取更多可能的争用点。
  4. 您可以尝试运行内环也是并行的。但是如果不知道你的算法,很难知道这是否会对你有帮助或伤害。幸运的是,运行并行和顺序集合的所有组合非常简单;只需在下面的代码中将 pVertexList vertexList 切换出去。

类似这样的:

  val pVertexList = vertexList.par 
val allResult = for {
u <-p​​VertexList
v <-pVertexList
} yield {
/ *在u和v * /
之间运行算法((u-> v) - >结果)
}

allResult 将是一个 ParVector [((Int,Int),Int)] 。你可以调用 .toMap 来将它转换为 Map


So I've been working with parallel collections in Scala for a graph project I'm working on, I've got the basics of the graph class defined, it is currently using a scala.collection.mutable.HashMap where the key is Int and the value is ListBuffer[Int] (adjacency list). (EDIT: This has since been change to ArrayBuffer[Int]

I had done a similar thing a few months ago in C++, with a std::vector<int, std::vector<int> >.

What I'm trying to do now is run a metric between all pairs of vertices in the graph, so in C++ I did something like this:

// myVec = std::vector<int> of vertices
for (std::vector<int>::iterator iter = myVec.begin(); iter != myVec.end(); ++iter) {
    for (std::vector<int>::iterator iter2 = myVec.begin(); 
        iter2 != myVec.end(); ++iter2) {
        /* Run algorithm between *iter and *iter2 */
    }
}

I did the same thing in Scala, parallelized, (or tried to) by doing this:

// vertexList is a List[Int] (NOW CHANGED TO Array[Int] - see below)
vertexList.par.foreach(u =>
  vertexList.foreach(v =>
    /* Run algorithm between u and v */
  )
)

The C++ version is clearly single-threaded, the Scala version has .par so it's using parallel collections and is multi-threaded on 8 cores (same machine). However, the C++ version processed 305,570 pairs in a span of roughly 3 days, whereas the Scala version so far has only processed 23,573 in 17 hours.

Assuming I did my math correctly, the single-threaded C++ version is roughly 3x faster than the Scala version. Is Scala really that much slower than C++, or am I completely mis-using Scala (I only recently started - I'm about 300 pages into Programming in Scala)?

Thanks! -kstruct

EDIT To use a while loop, do I do something like..

// Where vertexList is an Array[Int]
vertexList.par.foreach(u =>
  while (i <- 0 until vertexList.length) {
    /* Run algorithm between u and vertexList(i) */
  }
}

If you guys mean use a while loop for the entire thing, is there an equivalent of .par.foreach for whiles?

EDIT2 Wait a second, that code isn't even right - my bad. How would I parallelize this using while loops? If I have some var i that keeps track of the iteration, then wouldn't all threads be sharing that i?

解决方案

From your comments, I see that your updating a shared mutable HashMap at the end of each algorithm run. And if you're randomizing your walks, a shared Random is also a contention point.

I recommend two changes:

  1. Use .map and .flatMap to return an immutable collection instead of modifying a shared collection.
  2. Use a ThreadLocalRandom (from either Akka or Java 7) to reduce contention on the random number generator
  3. Check the rest of your algorithm for further possible contention points.
  4. You may try running the inner loop in parallel, too. But without knowing your algorithm, it's hard to know if that will help or hurt. Fortunately, running all combinations of parallel and sequential collections is very simple; just switch out pVertexList and vertexList in the code below.

Something like this:

val pVertexList = vertexList.par
val allResult = for {
  u <- pVertexList
  v <- pVertexList
} yield {
  /* Run algorithm between u and v */
  ((u -> v) -> result)
}

The value allResult will be a ParVector[((Int, Int), Int)]. You may call .toMap on it to convert that into a Map.

这篇关于Scala中平行集合的效率/可伸缩性(图形)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆