星火指定多个列的条件加入数据框 [英] Spark specify multiple column conditions for dataframe join

查看:123
本文介绍了星火指定多个列的条件加入数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

连接两个dataframes时怎样给多个列的条件。比如我要运行以下命令:

How to give more column conditions when joining two dataframes. For example I want to run the following :

val Lead_all = Leads.join(Utm_Master,  
    Leaddetails.columns("LeadSource","Utm_Source","Utm_Medium","Utm_Campaign") ==
    Utm_Master.columns("LeadSource","Utm_Source","Utm_Medium","Utm_Campaign"),
"left")

我想只有当这些列相匹配的加盟。不过,上述语法是无效的COLS只需要一个字符串。那么,如何得到我想要的东西。

I want to join only when these columns match. But above syntax is not valid as cols only takes one string. So how do I get what I want.

推荐答案

有一个Spark <一个href=\"https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala#L514\">column/ex$p$pssion API加入了解这种情况:

There is a Spark column/expression API join for such case:

Leads.join(
    Utm_Master, 
    Leads("LeadSource") <=> Utm_Master("LeadSource")
        && Leads("Utm_Source") <=> Utm_Master("Utm_Source")
        && Leads("Utm_Medium") <=> Utm_Master("Utm_Medium")
        && Leads("Utm_Campaign") <=> Utm_Master("Utm_Campaign"),
    "left"
)

&LT; = GT; 运营商在这个例子的意思是<一href=\"https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Column.scala#L320\">Equality测试是安全的空值。

The <=> operator in the example means "Equality test that is safe for null values".

简单的主要区别<一href=\"https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Column.scala#L132\">Equality测试( === )是第一个是安全的情况下使用的一列可能有空值。

The main difference with simple Equality test (===) is that the first one is safe to use in case one of the columns may have null values.

这篇关于星火指定多个列的条件加入数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆