在Apache中星火,为什么不RDD.union preserve的分区? [英] In Apache Spark, why does RDD.union not preserve the partitioner?

查看:308
本文介绍了在Apache中星火,为什么不RDD.union preserve的分区?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

殊不知在星火partitioners有任何广的操作巨大的性能影响,因此它通常在业务定制。我用下面的code试验:

As everyone knows partitioners in Spark have a huge performance impact on any "wide" operations, so it's usually customized in operations. I was experimenting with the following code:

val rdd1 =
  sc.parallelize(1 to 50).keyBy(_ % 10)
    .partitionBy(new HashPartitioner(10))
val rdd2 =
  sc.parallelize(200 to 230).keyBy(_ % 13)

val cogrouped = rdd1.cogroup(rdd2)
println("cogrouped: " + cogrouped.partitioner)

val unioned = rdd1.union(rdd2)
println("union: " + unioned.partitioner)

我看到默认协同组()总是产生一个RDD使用自定义分区,但工会()不,它总是会恢复到默认值。这是违反直觉的,因为我们通常认为一个PairRDD应该使用它的第一个元素作为分区键。是否有力星火合并2 PairRDDs使用相同的分区键的方式?

I see that by default cogroup() always yields an RDD with the customized partitioner, but union() doesn't, it will always revert back to default. This is counterintuitive as we usually assume that a PairRDD should use its first element as partition key. Is there a way to "force" Spark to merge 2 PairRDDs to use the same partition key?

推荐答案

联盟是一种非常有效的操作,因为它不走动的任何数据。如果 RDD1集有10个分区和 RDD2 有20个分区,那么 rdd1.union(RDD2)将有30个分区:二RDDS后对方放的分​​区。这仅仅是一个记账的变化,没有洗牌。

union is a very efficient operation, because it doesn't move any data around. If rdd1 has 10 partitions and rdd2 has 20 partitions then rdd1.union(rdd2) will have 30 partitions: the partitions of the two RDDs put after each other. This is just a bookkeeping change, there is no shuffle.

但不一定它丢弃分区。一分割器被构造为分区中的给定数目。由此产生的RDD有多个分区是从不同的两个 RDD1集 RDD2

But necessarily it discards the partitioner. A partitioner is constructed for a given number of partitions. The resulting RDD has a number of partitions that is different from both rdd1 and rdd2.

服用可以运行再分配来洗牌的数据,并通过关键组织的。合并之后,

After taking the union you can run repartition to shuffle the data and organize it by key.

有一个例外以上。如果 RDD1集 RDD2 具有相同的分区(具有相同的分区数),工会行为有所不同。它将加入两个RDDS成对的分区,赋予其相同的分区数为每一个输入了。这可能涉及中移动数据(如果分区不位于同一位置),但不涉及洗牌。在这种情况下,分区被保留。 (在code因为这是在<一个href=\"https://github.com/apache/spark/blob/v1.3.1/core/src/main/scala/org/apache/spark/rdd/PartitionerAwareUnionRDD.scala\">PartitionerAwareUnionRDD.scala.)

There is one exception to the above. If rdd1 and rdd2 have the same partitioner (with the same number of partitions), union behaves differently. It will join the partitions of the two RDDs pairwise, giving it the same number of partitions as each of the inputs had. This may involve moving data around (if the partitions were not co-located) but will not involve a shuffle. In this case the partitioner is retained. (The code for this is in PartitionerAwareUnionRDD.scala.)

这篇关于在Apache中星火,为什么不RDD.union preserve的分区?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆