加入两个RDD [英] Join two RDD in spark

查看:67
本文介绍了加入两个RDD的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个rdd,一个rdd,只有一列,另外有两列,将键上的两个RDD连接起来,我添加了虚拟值0,请问还有其他有效的方法可以使用join吗?

I have two rdd one rdd have just one column other have two columns to join the two RDD on key's I have add dummy value which is 0 , is there any other efficient way of doing this using join ?

val lines = sc.textFile("ml-100k/u.data")
val movienamesfile = sc.textFile("Cml-100k/u.item")

val moviesid = lines.map(x => x.split("\t")).map(x => (x(1),0))
val test = moviesid.map(x => x._1)
val movienames = movienamesfile.map(x => x.split("\\|")).map(x => (x(0),x(1)))
val shit = movienames.join(moviesid).distinct()

修改:

让我用SQL转换此问题.举例来说,我有 table1(移动影片) table2(电影编号,电影名称).在SQL中,我们编写如下内容:

Let me convert this question in SQL. Say for example I have table1 (moveid) and table2 (movieid,moviename). In SQL we write something like:

select moviename, movieid, count(1)
from table2 inner join table table1 on table1.movieid=table2.moveid 
group by ....

在SQL中, table1 只有一列,而 table2 仍然有两列, join 仍然有效,Spark中的相同方法可以在来自两个RDD的密钥.

here in SQL table1 has only one column where as table2 has two columns still the join works, same way in Spark can join on keys from both the RDD's.

推荐答案

联接操作仅在 PairwiseRDD 上定义,这与SQL中的关系/表完全不同. PairwiseRDD 的每个元素都是一个 Tuple2 ,其中第一个元素是 key ,第二个元素是 value .只要 key 提供有意义的 hashCode

Join operation is defined only on PairwiseRDDs which are quite different from a relation / table in SQL. Each element of PairwiseRDD is a Tuple2 where the first element is the key and the second is value. Both can contain complex objects as long as key provides a meaningful hashCode

如果您想以SQL形式考虑这一点,则可以考虑将键视为必需,因为转到 ON 子句的所有内容和 value 都包含选定的列.

If you want to think about this in a SQL-ish you can consider key as everything that goes to ON clause and value contains selected columns.

SELECT table1.value, table2.value
FROM table1 JOIN table2 ON table1.key = table2.key

虽然这些方法乍一看看起来很相似,但是您可以使用另一种方法来表达它们,但是有一个根本的区别.当您查看SQL表并忽略约束时,所有列都属于同一类对象,而 PairwiseRDD 中的 key value 清楚的意思.

While these approaches look similar at first glance and you can express one using another there is one fundamental difference. When you look at the SQL table and you ignore constraints all columns belong in the same class of objects, while key and value in the PairwiseRDD have a clear meaning.

回到您的问题以使用 join 时,您同时需要 key value .可以说,使用 null 单例比使用 0 作为占位符要干净得多,但实际上没有办法解决.

Going back to your problem to use join you need both key and value. Arguably much cleaner than using 0 as a placeholder would be to use null singleton but there is really no way around it.

对于小数据,您可以通过类似的方式使用过滤器广播连接:

For small data you can use filter in a similar way to broadcast join:

val moviesidBD = sc.broadcast(
  lines.map(x => x.split("\t")).map(_.head).collect.toSet)

movienames.filter{case (id, _) => moviesidBD.value contains id}

但是,如果您确实希望使用SQL类的连接,则只需使用SparkSQL.

but if you really want SQL-ish joins then you should simply use SparkSQL.

val movieIdsDf = lines
   .map(x => x.split("\t"))
   .map(a => Tuple1(a.head))
   .toDF("id")

val movienamesDf = movienames.toDF("id", "name")

// Add optional join type qualifier 
movienamesDf.join(movieIdsDf, movieIdsDf("id") <=> movienamesDf("id"))

这篇关于加入两个RDD的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆