两个RDD之间的Apache Spark差异 [英] Apache Spark difference between two RDDs

查看:142
本文介绍了两个RDD之间的Apache Spark差异的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

  def set1 = [] 
def set2 = []
0.upto(10){set1<< it}
8.upto(20){set2<< it}
def rdd1 = context.parallelize(set1)
def rdd2 = context.parallelize(set2)

//下一步是什么?

如何获得两者之间的增量?我知道 union 可以创建一个包含这些RDD中所有数据的RDD,但我该怎么做呢?

解决方案

如果你只是想要一个减法减去将是一个答案。如果你想外部集合尝试:

  rdd1.subtract(rdd2).union(rdd2.subtract(rdd1)) 


Say I have this example job (in Groovy w/ Java API):

def set1 = []
def set2 = []
0.upto(10) { set1 << it }
8.upto(20) { set2 << it }
def rdd1 = context.parallelize(set1)
def rdd2 = context.parallelize(set2)

//What next?

How do I get a set that is the delta between the two? I know that union can create a RDD that has all of the data in those RDDs, but how do I do the opposite of that?

解决方案

If you just want a set subtraction subtract would be an answer. If you want the "outer" collection try:

rdd1.subtract(rdd2).union(rdd2.subtract(rdd1))

这篇关于两个RDD之间的Apache Spark差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆