Spark:将数据框的每一行与另一个数据框的所有行连接的方式 [英] Spark: way to join each row of dataframe with all rows of another dataframe
本文介绍了Spark:将数据框的每一行与另一个数据框的所有行连接的方式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
假设我具有以下数据框:
Assuming I am having the following dataframes:
val df1 = sc.parallelize(Seq("a1" -> "a2", "b1" -> "b2", "c1" -> "c2")).toDF("a", "b")
val df2 = sc.parallelize(Seq("aa1" -> "aa2", "bb1" -> "bb2")).toDF("aa", "bb")
我想要以下内容:
| a | b | aa | bb |
----------------------
| a1 | a2 | aa1 | aa2 |
| a1 | a2 | bb1 | bb2 |
| b1 | b2 | aa1 | aa2 |
| b1 | b2 | bb1 | bb2 |
| c1 | c2 | aa1 | aa2 |
| c1 | c2 | bb1 | bb2 |
因此, df1
的每一行都映射到 df2
的所有行.我的操作方式如下:
So each row of df1
to map to all of the rows of df2
. The way I am doing it is the following:
val df1_dummy = df1.withColumn("dummy_df1", lit("dummy"))
val df2_dummy = df2.withColumn("dummy_df2", lit("dummy"))
val desired_result = df1_dummy
.join(df2_dummy, $"dummy_df1" === $"dummy_df2", "left")
.drop("dummy_df1")
.drop("dummy_df2")
它给出了预期的结果,但似乎有点不好.有更有效的方法吗?有什么建议吗?
It gives the desired results but it seems a bit of a bad way. Is there a more efficient way of doing that? any recommendation?
推荐答案
这就是 crossJoin
的用途:
val result = df1.crossJoin(df2)
result.show()
// +---+---+---+---+
// |a |b |aa |bb |
// +---+---+---+---+
// |a1 |a2 |aa1|aa2|
// |a1 |a2 |bb1|bb2|
// |b1 |b2 |aa1|aa2|
// |b1 |b2 |bb1|bb2|
// |c1 |c2 |aa1|aa2|
// |c1 |c2 |bb1|bb2|
// +---+---+---+---+
这篇关于Spark:将数据框的每一行与另一个数据框的所有行连接的方式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文