Apache Spark中的join和cogroup有什么区别 [英] What's the difference between join and cogroup in Apache Spark

查看:275
本文介绍了Apache Spark中的join和cogroup有什么区别的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Apache Spark中的join和cogroup有什么区别?每种方法的用例是什么?

What's the difference between join and cogroup in Apache Spark? What's the use case for each method?

推荐答案

让我来帮助您弄清它们,它们都是常用且重要的!

def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]

这是联接的prototype请仔细查看.例如,

val rdd1 = sc.makeRDD(Array(("A","1"),("B","2"),("C","3")),2)
val rdd2 = sc.makeRDD(Array(("A","a"),("C","c"),("D","d")),2)
 
scala> rdd1.join(rdd2).collect
res0: Array[(String, (String, String))] = Array((A,(1,a)), (C,(3,c)))

出现在最终结果中的所有键是rdd1和rdd2共有的.这类似于relation database operation INNER JOIN.

All keys that will appear in the final result is common to rdd1 and rdd2. This is similar to relation database operation INNER JOIN.

但是共同组不同

def cogroup[W](other: RDD[(K, W)]): RDD[(K, (Iterable[V], Iterable[W]))]

因为一个键至少出现在两个rdds中的任何一个中,它将出现在最终结果中,让我澄清一下:

as one key at least appear in either of the two rdds, it will appear in the final result, let me clarify it:

val rdd1 = sc.makeRDD(Array(("A","1"),("B","2"),("C","3")),2)
val rdd2 = sc.makeRDD(Array(("A","a"),("C","c"),("D","d")),2)

scala> var rdd3 = rdd1.cogroup(rdd2).collect
res0: Array[(String, (Iterable[String], Iterable[String]))] = Array(
(B,(CompactBuffer(2),CompactBuffer())), 
(D,(CompactBuffer(),CompactBuffer(d))), 
(A,(CompactBuffer(1),CompactBuffer(a))), 
(C,(CompactBuffer(3),CompactBuffer(c)))
)

这对relation database operation FULL OUTER JOIN来说非常similar,但是与其将每条记录的每行结果展平,它会为您提供iterable interface ,以下操作为取决于您

This is very similar to relation database operation FULL OUTER JOIN, but instead of flattening the result per line per record, it will give you the iterable interface to you, the following operation is up to you as convenient!

祝你好运!

Spark文档为: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions

Spark docs is: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions

这篇关于Apache Spark中的join和cogroup有什么区别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆