不能与数目不等的分区拉链RDDS [英] Can't zip RDDs with unequal numbers of partitions
问题描述
现在我有3个RDDS是这样的:
Now I have 3 RDDs like this:
RDD1集:
1 2
3 4
5 6
7 8
9 10
RDD2:
11 12
13 14
rdd3:
15 16
17 18
19 20
和我想这样做:
rdd1.zip(rdd2.union(rdd3))
和我想要的结果是这样的:
and I want the result is like this:
1 2 11 12
3 4 13 14
5 6 15 16
7 8 17 18
9 10 19 20
但我有一个这样的例外:
but I have an exception like this:
在异常线程主java.lang.IllegalArgumentException异常:不能与数目不等的分区拉链RDDS
Exception in thread "main" java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of partitions
有人告诉我,我可以做到这一点没有例外:
someone tell me I can do this without exception:
rdd1.zip(rdd2.union(rdd3).repartition(1))
但好像它是一个小的成本。所以我想知道是否有其他的方法来解决这个问题。
But it seems like it is a little cost. So I want to know if there is other ways to solve this problem.
推荐答案
我不知道你所说的成本的意思,但你说得对怀疑重新分区(1)
不是合适的解决方案。它会重新分区RDD到单个分区。
I'm not sure what you mean by "cost", but you're right to suspect that repartition(1)
is not the right solution. It will repartition the RDD to a single partition.
- 如果您的数据不适合在一台机器上,这将失败。
- 它只能当
RDD1集
有一个分区。当你有更多的数据,这将很可能不再成立。 -
再分配
执行的随机的,所以你的数据最终会下令不同。
- If your data does not fit on a single machine, this will fail.
- It only works if
rdd1
has a single partition. When you have more data this will probably no longer hold. repartition
performs a shuffle, so your data can end up ordered differently.
我认为正确的解决办法是放弃使用拉链
,因为你有可能不能确保分区将匹配。创建密钥,并使用加入
而不是:
I think the right solution is to give up on using zip
, because you likely cannot ensure that the partitioning will match up. Create a key and use join
instead:
val indexedRDD1 = rdd1.zipWithIndex.map { case (v, i) => i -> v }
val indexedRDD2 = rdd2.zipWithIndex.map { case (v, i) => i -> v }
val offset = rdd2.count
val indexedRDD3 = rdd3.zipWithIndex.map { case (v, i) => (i + offset) -> v }
val combined =
indexedRDD1.leftOuterJoin(indexedRDD2).leftOuterJoin(indexedRDD3).map {
case (i, ((v1, v2Opt), v3Opt)) => i -> (v1, v2Opt.getOrElse(v3Opt.get))
}
这会工作,不管分区。如果你喜欢,你可以排序的结果,并在年底删除索引:
This will work no matter the partitioning. If you like, you can sort the result and remove the index at the end:
val unindexed = combined.sortByKey().values
这篇关于不能与数目不等的分区拉链RDDS的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!