做了加盟合作分区RDDS原因在Apache中引发洗牌? [英] Does a join of co-partitioned RDDs cause a shuffle in Apache Spark?
问题描述
威尔 rdd1.join(RDD2)
引发洗牌发生,如果 RDD1集
和 RDD2
具有相同的分区?
Will rdd1.join(rdd2)
cause a shuffle to happen if rdd1
and rdd2
have the same partitioner?
推荐答案
没有。如果两个RDDS具有相同的分区中,加入
不会导致洗牌。你可以看到这个<一个href=\"https://github.com/apache/spark/blob/v1.2.0/core/src/main/scala/org/apache/spark/rdd/CoGroupedRDD.scala\"><$c$c>CoGroupedRDD.scala$c$c>:
No. If two RDDs have the same partitioner, the join
will not cause a shuffle. You can see this in CoGroupedRDD.scala
:
override def getDependencies: Seq[Dependency[_]] = {
rdds.map { rdd: RDD[_ <: Product2[K, _]] =>
if (rdd.partitioner == Some(part)) {
logDebug("Adding one-to-one dependency with " + rdd)
new OneToOneDependency(rdd)
} else {
logDebug("Adding shuffle dependency with " + rdd)
new ShuffleDependency[K, Any, CoGroupCombiner](rdd, part, serializer)
}
}
}
但是要注意,缺乏洗牌并不意味着没有数据将具有节点之间移动。这是可能的两个RDDS具有相同的分割器(被共分区)还具有位于不同节点上的相应分区(未共同定位)。
Note however, that the lack of a shuffle does not mean that no data will have to be moved between nodes. It's possible for two RDDs to have the same partitioner (be co-partitioned) yet have the corresponding partitions located on different nodes (not be co-located).
这形势依然比做一个洗牌更好,但它的东西要记住。一地两检可以提高性能,但难以保证。
This situation is still better than doing a shuffle, but it's something to keep in mind. Co-location can improve performance, but is hard to guarantee.
这篇关于做了加盟合作分区RDDS原因在Apache中引发洗牌?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!