比较两个RDD [英] Comparing two RDDs
问题描述
我有两个RDD [Array [String]],我们称它们为rdd1和rdd2. 我将创建一个新的RDD,其中仅包含rdd2的条目,而不是rdd1中的条目(基于密钥). 我通过Intellij在Scala上使用Spark.
I have two RDD[Array[String]], let's call them rdd1 and rdd2. I would create a new RDD containing just the entries of rdd2 not in rdd1 (based on a key). I use Spark on Scala via Intellij.
我通过一个密钥将rdd1和rdd2分组(我将只比较两个rdds的密钥):
I grouped rdd1 and rdd2 by a key (I will compare just the keys of the two rdds):
val rdd1Grouped = rdd1.groupBy(line => line(0))
val rdd2Grouped = rdd2.groupBy(line => line(0))
然后,我使用了leftOuterJoin
:
val output = rdd1Grouped.leftOuterJoin(rdd2Grouped).collect {
case (k, (v, None)) => (k, v)
}
,但这似乎无法给出正确的结果.
but this doesn't seem to give the correct result.
这是怎么回事?有什么建议吗?
What's wrong with it? Any suggests?
RDDS的示例(每行都是Array [String],ofc):
rdd1 rdd2 output (in some form)
1,18/6/2016 2,9/6/2016 2,9/6/2016
1,18/6/2016 2,9/6/2016
1,18/6/2016 2,9/6/2016
1,18/6/2016 2,9/6/2016
1,18/6/2016 1,20/6/2016
3,18/6/2016 1,20/6/2016
3,18/6/2016 1,20/6/2016
3,18/6/2016
3,18/6/2016
3,18/6/2016
在这种情况下,我只想添加条目"2,9/6/2016",因为键"2"不在rdd1中.
In this case I wanna add just the entry "2,9/6/2016" because the key "2" is not in rdd1.
推荐答案
仅包含rdd2条目而不是rdd1的新RDD
new RDD containing just the entries of rdd2 not in rdd1
左连接将把所有键保留在rdd1中,并附加RDD2匹配键值的列.因此,显然左联接/外联接不是解决方案.
left join would retain all keys in rdd1 and append columns of RDD2 matching key values. So clearly left join/outer join is not the solution.
rdd1Grouped.subtractByKey(rdd2Grouped)
适用于您的情况.
P.S. :还请注意,如果rdd1较小,则更好地广播它.这样,减法时仅流第二个rdd.
P.S. : Also note that if rdd1 is smaller better broadcast it. In that way, only second rdd would be streamed at the time of subtract.
这篇关于比较两个RDD的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!