做了加盟合作分区RDDS原因在Apache中引发洗牌? [英] Does a join of co-partitioned RDDs cause a shuffle in Apache Spark?

查看:311
本文介绍了做了加盟合作分区RDDS原因在Apache中引发洗牌?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

威尔 rdd1.join(RDD2)引发洗牌发生,如果 RDD1集 RDD2 具有相同的分区?

Will rdd1.join(rdd2) cause a shuffle to happen if rdd1 and rdd2 have the same partitioner?

推荐答案

没有。如果两个RDDS具有相同的分区中,加入不会导致洗牌。你可以看到这个<一个href=\"https://github.com/apache/spark/blob/v1.2.0/core/src/main/scala/org/apache/spark/rdd/CoGroupedRDD.scala\"><$c$c>CoGroupedRDD.scala:

No. If two RDDs have the same partitioner, the join will not cause a shuffle. You can see this in CoGroupedRDD.scala:

override def getDependencies: Seq[Dependency[_]] = {
  rdds.map { rdd: RDD[_ <: Product2[K, _]] =>
    if (rdd.partitioner == Some(part)) {
      logDebug("Adding one-to-one dependency with " + rdd)
      new OneToOneDependency(rdd)
    } else {
      logDebug("Adding shuffle dependency with " + rdd)
      new ShuffleDependency[K, Any, CoGroupCombiner](rdd, part, serializer)
    }
  }
}

但是要注意,缺乏洗牌并不意味着没有数据将具有节点之间移动。这是可能的两个RDDS具有相同的分割器(被共分区)还具有位于不同节点上的相应分区(未共同定位)。

Note however, that the lack of a shuffle does not mean that no data will have to be moved between nodes. It's possible for two RDDs to have the same partitioner (be co-partitioned) yet have the corresponding partitions located on different nodes (not be co-located).

这形势依然比做一个洗牌更好,但它的东西要记住。一地两检可以提高性能,但难以保证。

This situation is still better than doing a shuffle, but it's something to keep in mind. Co-location can improve performance, but is hard to guarantee.

这篇关于做了加盟合作分区RDDS原因在Apache中引发洗牌?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆