阿帕奇星火RDD过滤成两个RDDS [英] Apache Spark RDD filter into two RDDs
问题描述
我需要一个RDD分为两部分:
I need to split an RDD into 2 parts:
其中满足条件1部分;这不另一部分。我可以做过滤器
两次对原RDD但似乎效率不高。有没有一种方法,可以做我后?我找不到在API也不在任何文献
1 part which satisfies a condition; another part which does not. I can do filter
twice on the original RDD but it seems inefficient. Is there a way that can do what I'm after? I can't find anything in the API nor in the literature.
推荐答案
星火默认不支持这一点。过滤相同的数据的两倍并不坏,如果你缓存它事先和过滤本身就是快。
Spark doesn't support this by default. Filtering on the same data twice isn't that bad if you cache it beforehand, and the filtering itself is quick.
如果它真的只是两种不同的类型,你可以使用一个辅助方法:
If it's really just two different types, you can use a helper method:
implicit class RDDOps[T](rdd: RDD[T]) {
def partitionBy(f: T => Boolean): (RDD[T], RDD[T]) = {
val passes = rdd.filter(f)
val fails = rdd.filter(e => !f(e)) // Spark doesn't have filterNot
(passes, fails)
}
}
val (matches, matchesNot) = sc.parallelize(1 to 100).cache().partitionBy(_ % 2 == 0)
但只要你有多种类型的数据,只是分配过滤到一个新的VAL。
But as soon as you have multiple types of data, just assign the filtered to a new val.
这篇关于阿帕奇星火RDD过滤成两个RDDS的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!