在Apache中星火,为什么不RDD.union preserve的分区? [英] In Apache Spark, why does RDD.union not preserve the partitioner?
问题描述
殊不知在星火partitioners有任何广的操作巨大的性能影响,因此它通常在业务定制。我用下面的code试验:
As everyone knows partitioners in Spark have a huge performance impact on any "wide" operations, so it's usually customized in operations. I was experimenting with the following code:
val rdd1 =
sc.parallelize(1 to 50).keyBy(_ % 10)
.partitionBy(new HashPartitioner(10))
val rdd2 =
sc.parallelize(200 to 230).keyBy(_ % 13)
val cogrouped = rdd1.cogroup(rdd2)
println("cogrouped: " + cogrouped.partitioner)
val unioned = rdd1.union(rdd2)
println("union: " + unioned.partitioner)
我看到默认协同组()
总是产生一个RDD使用自定义分区,但工会()
不,它总是会恢复到默认值。这是违反直觉的,因为我们通常认为一个PairRDD应该使用它的第一个元素作为分区键。是否有力星火合并2 PairRDDs使用相同的分区键的方式?
I see that by default cogroup()
always yields an RDD with the customized partitioner, but union()
doesn't, it will always revert back to default. This is counterintuitive as we usually assume that a PairRDD should use its first element as partition key. Is there a way to "force" Spark to merge 2 PairRDDs to use the same partition key?
推荐答案
联盟
是一种非常有效的操作,因为它不走动的任何数据。如果 RDD1集
有10个分区和 RDD2
有20个分区,那么 rdd1.union(RDD2)
将有30个分区:二RDDS后对方放的分区。这仅仅是一个记账的变化,没有洗牌。
union
is a very efficient operation, because it doesn't move any data around. If rdd1
has 10 partitions and rdd2
has 20 partitions then rdd1.union(rdd2)
will have 30 partitions: the partitions of the two RDDs put after each other. This is just a bookkeeping change, there is no shuffle.
但不一定它丢弃分区。一分割器被构造为分区中的给定数目。由此产生的RDD有多个分区是从不同的两个 RDD1集
和 RDD2
。
But necessarily it discards the partitioner. A partitioner is constructed for a given number of partitions. The resulting RDD has a number of partitions that is different from both rdd1
and rdd2
.
服用可以运行再分配
来洗牌的数据,并通过关键组织的。合并之后,
After taking the union you can run repartition
to shuffle the data and organize it by key.
有一个例外以上。如果 RDD1集
和 RDD2
具有相同的分区(具有相同的分区数),工会
行为有所不同。它将加入两个RDDS成对的分区,赋予其相同的分区数为每一个输入了。这可能涉及中移动数据(如果分区不位于同一位置),但不涉及洗牌。在这种情况下,分区被保留。 (在code因为这是在<一个href=\"https://github.com/apache/spark/blob/v1.3.1/core/src/main/scala/org/apache/spark/rdd/PartitionerAwareUnionRDD.scala\">PartitionerAwareUnionRDD.scala.)
There is one exception to the above. If rdd1
and rdd2
have the same partitioner (with the same number of partitions), union
behaves differently. It will join the partitions of the two RDDs pairwise, giving it the same number of partitions as each of the inputs had. This may involve moving data around (if the partitions were not co-located) but will not involve a shuffle. In this case the partitioner is retained. (The code for this is in PartitionerAwareUnionRDD.scala.)
这篇关于在Apache中星火,为什么不RDD.union preserve的分区?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!