基于非空值加入Spark数据帧(scala) [英] Join in spark dataframe (scala) based on not null values

查看：58 发布时间：2020/9/4 8:36:22 scala apache-spark dataframe

本文介绍了基于非空值加入Spark数据帧(scala)的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

假设我有两个数据帧，如下所示:

Suppose I've two dataframes like following :

第一-

A    | B    | C    | D
1a   | 1b   | 1c   | 1d
2a   | null | 2c   | 2d
3a   | null | null | 3d
4a   | 4b   | null | null
5a   | null | null | null
6a   | 6b   | 6c   | null

第二个-

P    | B    | C    | D
1p   | 1b   | 1c   | 1d
2p   | 2b   | 2c   | 2d
3p   | 3b   | 3c   | 3d
4p   | 4b   | 4c   | 4d 
5p   | 5b   | 5c   | 5d
6p   | 6b   | 6c   | 6d

基于{"B"，"C"，"D"}执行加入操作.如果在这些列中的任何列中出现空值，则应检查在其余列中是否不存在空值.

The join operation is performed based on {"B", "C", "D"}. In case of null occurring in any of these columns, it should check for not null values occurring in remaining columns.

因此，结果应类似于-

P    | B    | C    | D    | A
1p   | 1b   | 1c   | 1d   | 1a
2p   | null | 2c   | 2d   | 2a
3p   | null | null | 3d   | 3a
4p   | 4b   | null | null | 4a // First(C) & First(D) was null so we take only B
6p   | 6b   | 6c   | null | 6a

有人可以为该查询建议任何解决方案吗? 目前，我正在尝试过滤在单列，两列，三列中具有空值的值.然后在不参加该专栏的情况下与Second一起加入他们.例如- 我首先从First过滤掉只有B为空的值.然后将其与基于"C"和"D"的Second结合在一起. 这样，我将获得许多数据帧，并最终将它们合并.

Can anyone suggest any solution for this query ? Currently I am trying to filter values having null values in single column, two columns, three columns. Then joining them with Second without taking that column. For eg - I first filtered out values having only B as null from First. Then joining it with Second based on "C" and "D". In this way, I will get many dataframes and I will finally union them.

推荐答案

这是您可以做的

import org.apache.spark.sql.functions._
df1.join(broadcast(df2), df1("B") === df2("B") || df1("C") === df2("C") || df1("D") === df2("D"))
  .drop(df2("B"))
  .drop(df2("C"))
  .drop(df2("D"))
  .show(false)

为了更加安全，您可以broadcast尺寸较小的dataframe.

to be more safe you can broadcast the dataframe which is smaller in size.

这篇关于基于非空值加入Spark数据帧(scala)的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

基于非空值加入Spark数据帧(scala) [英] Join in spark dataframe (scala) based on not null values

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

基于非空值加入Spark数据帧(scala) [英] Join in spark dataframe (scala) based on not null values

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭