两个RDD之间的Apache Spark差异 [英] Apache Spark difference between two RDDs
本文介绍了两个RDD之间的Apache Spark差异的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
def set1 = []
def set2 = []
0.upto(10){set1<< it}
8.upto(20){set2<< it}
def rdd1 = context.parallelize(set1)
def rdd2 = context.parallelize(set2)
//下一步是什么?
如何获得两者之间的增量?我知道 union
可以创建一个包含这些RDD中所有数据的RDD,但我该怎么做呢?
解决方案
如果你只是想要一个减法减去将是一个答案。如果你想外部集合尝试:
rdd1.subtract(rdd2).union(rdd2.subtract(rdd1))
Say I have this example job (in Groovy w/ Java API):
def set1 = []
def set2 = []
0.upto(10) { set1 << it }
8.upto(20) { set2 << it }
def rdd1 = context.parallelize(set1)
def rdd2 = context.parallelize(set2)
//What next?
How do I get a set that is the delta between the two? I know that union
can create a RDD that has all of the data in those RDDs, but how do I do the opposite of that?
解决方案
If you just want a set subtraction subtract would be an answer. If you want the "outer" collection try:
rdd1.subtract(rdd2).union(rdd2.subtract(rdd1))
这篇关于两个RDD之间的Apache Spark差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文