阿帕奇星火RDD过滤成两个RDDS [英] Apache Spark RDD filter into two RDDs

查看:171
本文介绍了阿帕奇星火RDD过滤成两个RDDS的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要一个RDD分为两部分:

I need to split an RDD into 2 parts:

其中满足条件1部分;这不另一部分。我可以做过滤器两次对原RDD但似乎效率不高。有没有一种方法,可以做我后?我找不到在API也不在任何文献

1 part which satisfies a condition; another part which does not. I can do filter twice on the original RDD but it seems inefficient. Is there a way that can do what I'm after? I can't find anything in the API nor in the literature.

推荐答案

星火默认不支持这一点。过滤相同的数据的两倍并不坏,如果你缓存它事先和过滤本身就是快。

Spark doesn't support this by default. Filtering on the same data twice isn't that bad if you cache it beforehand, and the filtering itself is quick.

如果它真的只是两种不同的类型,你可以使用一个辅助方法:

If it's really just two different types, you can use a helper method:

implicit class RDDOps[T](rdd: RDD[T]) {
  def partitionBy(f: T => Boolean): (RDD[T], RDD[T]) = {
    val passes = rdd.filter(f)
    val fails = rdd.filter(e => !f(e)) // Spark doesn't have filterNot
    (passes, fails)
  }
}

val (matches, matchesNot) = sc.parallelize(1 to 100).cache().partitionBy(_ % 2 == 0)

但只要你有多种类型的数据,只是分配过滤到一个新的VAL。

But as soon as you have multiple types of data, just assign the filtered to a new val.

这篇关于阿帕奇星火RDD过滤成两个RDDS的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆