比较两个RDD [英] Comparing two RDDs

查看:132
本文介绍了比较两个RDD的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个RDD [Array [String]],我们称它们为rdd1和rdd2. 我将创建一个新的RDD,其中仅包含rdd2的条目,而不是rdd1中的条目(基于密钥). 我通过Intellij在Scala上使用Spark.

I have two RDD[Array[String]], let's call them rdd1 and rdd2. I would create a new RDD containing just the entries of rdd2 not in rdd1 (based on a key). I use Spark on Scala via Intellij.

我通过一个密钥将rdd1和rdd2分组(我将只比较两个rdds的密钥):

I grouped rdd1 and rdd2 by a key (I will compare just the keys of the two rdds):

val rdd1Grouped = rdd1.groupBy(line => line(0))
val rdd2Grouped = rdd2.groupBy(line => line(0))

然后,我使用了leftOuterJoin:

val output = rdd1Grouped.leftOuterJoin(rdd2Grouped).collect {
  case (k, (v, None)) => (k, v)
}

,但这似乎无法给出正确的结果.

but this doesn't seem to give the correct result.

这是怎么回事?有什么建议吗?

What's wrong with it? Any suggests?

RDDS的示例(每行都是Array [String],ofc):

rdd1                        rdd2                  output (in some form)

1,18/6/2016               2,9/6/2016                  2,9/6/2016
1,18/6/2016               2,9/6/2016 
1,18/6/2016               2,9/6/2016
1,18/6/2016               2,9/6/2016
1,18/6/2016               1,20/6/2016
3,18/6/2016               1,20/6/2016 
3,18/6/2016               1,20/6/2016
3,18/6/2016
3,18/6/2016
3,18/6/2016

在这种情况下,我只想添加条目"2,9/6/2016",因为键"2"不在rdd1中.

In this case I wanna add just the entry "2,9/6/2016" because the key "2" is not in rdd1.

推荐答案

仅包含rdd2条目而不是rdd1的新RDD

new RDD containing just the entries of rdd2 not in rdd1

左连接将把所有键保留在rdd1中,并附加RDD2匹配键值的列.因此,显然左联接/外联接不是解决方案.

left join would retain all keys in rdd1 and append columns of RDD2 matching key values. So clearly left join/outer join is not the solution.

rdd1Grouped.subtractByKey(rdd2Grouped)适用于您的情况.

P.S. :还请注意,如果rdd1较小,则更好地广播它.这样,减法时仅流第二个rdd.

P.S. : Also note that if rdd1 is smaller better broadcast it. In that way, only second rdd would be streamed at the time of subtract.

这篇关于比较两个RDD的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆