Apache Spark中的join和cogroup有什么区别 [英] What's the difference between join and cogroup in Apache Spark
问题描述
Apache Spark中的join和cogroup有什么区别?每种方法的用例是什么?
What's the difference between join and cogroup in Apache Spark? What's the use case for each method?
推荐答案
让我来帮助您弄清它们,它们都是常用且重要的!
def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]
这是联接的prototype
,请仔细查看.例如,
val rdd1 = sc.makeRDD(Array(("A","1"),("B","2"),("C","3")),2)
val rdd2 = sc.makeRDD(Array(("A","a"),("C","c"),("D","d")),2)
scala> rdd1.join(rdd2).collect
res0: Array[(String, (String, String))] = Array((A,(1,a)), (C,(3,c)))
出现在最终结果中的所有键是rdd1和rdd2共有的.这类似于relation database operation INNER JOIN
.
All keys that will appear in the final result is common to rdd1 and rdd2. This is similar to relation database operation INNER JOIN
.
但是共同组不同,
def cogroup[W](other: RDD[(K, W)]): RDD[(K, (Iterable[V], Iterable[W]))]
因为一个键至少出现在两个rdds中的任何一个中,它将出现在最终结果中,让我澄清一下:
as one key at least appear in either of the two rdds, it will appear in the final result, let me clarify it:
val rdd1 = sc.makeRDD(Array(("A","1"),("B","2"),("C","3")),2)
val rdd2 = sc.makeRDD(Array(("A","a"),("C","c"),("D","d")),2)
scala> var rdd3 = rdd1.cogroup(rdd2).collect
res0: Array[(String, (Iterable[String], Iterable[String]))] = Array(
(B,(CompactBuffer(2),CompactBuffer())),
(D,(CompactBuffer(),CompactBuffer(d))),
(A,(CompactBuffer(1),CompactBuffer(a))),
(C,(CompactBuffer(3),CompactBuffer(c)))
)
这对relation database operation FULL OUTER JOIN
来说非常similar
,但是与其将每条记录的每行结果展平,它会为您提供iterable interface
,以下操作为取决于您!
This is very similar
to relation database operation FULL OUTER JOIN
, but instead of flattening the result per line per record, it will give you the iterable interface
to you, the following operation is up to you as convenient!
祝你好运!
Spark文档为: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions
Spark docs is: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions
这篇关于Apache Spark中的join和cogroup有什么区别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!