星火指定多个列的条件加入数据框 [英] Spark specify multiple column conditions for dataframe join
问题描述
连接两个dataframes时怎样给多个列的条件。比如我要运行以下命令:
How to give more column conditions when joining two dataframes. For example I want to run the following :
val Lead_all = Leads.join(Utm_Master,
Leaddetails.columns("LeadSource","Utm_Source","Utm_Medium","Utm_Campaign") ==
Utm_Master.columns("LeadSource","Utm_Source","Utm_Medium","Utm_Campaign"),
"left")
我想只有当这些列相匹配的加盟。不过,上述语法是无效的COLS只需要一个字符串。那么,如何得到我想要的东西。
I want to join only when these columns match. But above syntax is not valid as cols only takes one string. So how do I get what I want.
推荐答案
有一个Spark <一个href=\"https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala#L514\">column/ex$p$pssion API加入了解这种情况:
There is a Spark column/expression API join for such case:
Leads.join(
Utm_Master,
Leads("LeadSource") <=> Utm_Master("LeadSource")
&& Leads("Utm_Source") <=> Utm_Master("Utm_Source")
&& Leads("Utm_Medium") <=> Utm_Master("Utm_Medium")
&& Leads("Utm_Campaign") <=> Utm_Master("Utm_Campaign"),
"left"
)
的&LT; = GT;
运营商在这个例子的意思是<一href=\"https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Column.scala#L320\">Equality测试是安全的空值。
The <=>
operator in the example means "Equality test that is safe for null values".
简单的主要区别<一href=\"https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Column.scala#L132\">Equality测试( ===
)是第一个是安全的情况下使用的一列可能有空值。
The main difference with simple Equality test (===
) is that the first one is safe to use in case one of the columns may have null values.
这篇关于星火指定多个列的条件加入数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!