Spark 为数据框连接指定多列条件 [英] Spark specify multiple column conditions for dataframe join
问题描述
如何在连接两个数据框时给出更多的列条件.例如我想运行以下:
How to give more column conditions when joining two dataframes. For example I want to run the following :
val Lead_all = Leads.join(Utm_Master,
Leaddetails.columns("LeadSource","Utm_Source","Utm_Medium","Utm_Campaign") ==
Utm_Master.columns("LeadSource","Utm_Source","Utm_Medium","Utm_Campaign"),
"left")
我只想在这些列匹配时加入.但是上面的语法是无效的,因为 cols 只需要一个字符串.那么我如何得到我想要的.
I want to join only when these columns match. But above syntax is not valid as cols only takes one string. So how do I get what I want.
推荐答案
有一个 Spark column/expression API join 用于这种情况:
There is a Spark column/expression API join for such case:
Leaddetails.join(
Utm_Master,
Leaddetails("LeadSource") <=> Utm_Master("LeadSource")
&& Leaddetails("Utm_Source") <=> Utm_Master("Utm_Source")
&& Leaddetails("Utm_Medium") <=> Utm_Master("Utm_Medium")
&& Leaddetails("Utm_Campaign") <=> Utm_Master("Utm_Campaign"),
"left"
)
示例中的 <=>
运算符表示对空值安全的平等测试".
The <=>
operator in the example means "Equality test that is safe for null values".
与简单的主要区别 平等测试 (===
) 是第一个可以安全使用,以防其中一列可能有空值.
The main difference with simple Equality test (===
) is that the first one is safe to use in case one of the columns may have null values.
这篇关于Spark 为数据框连接指定多列条件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!