Spark 为数据框连接指定多列条件 [英] Spark specify multiple column conditions for dataframe join

查看:28
本文介绍了Spark 为数据框连接指定多列条件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何在连接两个数据框时给出更多的列条件.例如我想运行以下:

How to give more column conditions when joining two dataframes. For example I want to run the following :

val Lead_all = Leads.join(Utm_Master,  
    Leaddetails.columns("LeadSource","Utm_Source","Utm_Medium","Utm_Campaign") ==
    Utm_Master.columns("LeadSource","Utm_Source","Utm_Medium","Utm_Campaign"),
"left")

我只想在这些列匹配时加入.但是上面的语法是无效的,因为 cols 只需要一个字符串.那么我如何得到我想要的.

I want to join only when these columns match. But above syntax is not valid as cols only takes one string. So how do I get what I want.

推荐答案

有一个 Spark column/expression API join 用于这种情况:

There is a Spark column/expression API join for such case:

Leaddetails.join(
    Utm_Master, 
    Leaddetails("LeadSource") <=> Utm_Master("LeadSource")
        && Leaddetails("Utm_Source") <=> Utm_Master("Utm_Source")
        && Leaddetails("Utm_Medium") <=> Utm_Master("Utm_Medium")
        && Leaddetails("Utm_Campaign") <=> Utm_Master("Utm_Campaign"),
    "left"
)

示例中的 <=> 运算符表示对空值安全的平等测试".

The <=> operator in the example means "Equality test that is safe for null values".

与简单的主要区别 平等测试 (===) 是第一个可以安全使用,以防其中一列可能有空值.

The main difference with simple Equality test (===) is that the first one is safe to use in case one of the columns may have null values.

这篇关于Spark 为数据框连接指定多列条件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆