基于非空值加入Spark数据帧(scala) [英] Join in spark dataframe (scala) based on not null values
问题描述
假设我有两个数据帧,如下所示:
Suppose I've two dataframes like following :
第一-
A | B | C | D
1a | 1b | 1c | 1d
2a | null | 2c | 2d
3a | null | null | 3d
4a | 4b | null | null
5a | null | null | null
6a | 6b | 6c | null
第二个-
P | B | C | D
1p | 1b | 1c | 1d
2p | 2b | 2c | 2d
3p | 3b | 3c | 3d
4p | 4b | 4c | 4d
5p | 5b | 5c | 5d
6p | 6b | 6c | 6d
基于{"B","C","D"}执行加入操作.如果在这些列中的任何列中出现空值,则应检查在其余列中是否不存在空值.
The join operation is performed based on {"B", "C", "D"}. In case of null occurring in any of these columns, it should check for not null values occurring in remaining columns.
因此,结果应类似于-
P | B | C | D | A
1p | 1b | 1c | 1d | 1a
2p | null | 2c | 2d | 2a
3p | null | null | 3d | 3a
4p | 4b | null | null | 4a // First(C) & First(D) was null so we take only B
6p | 6b | 6c | null | 6a
有人可以为该查询建议任何解决方案吗? 目前,我正在尝试过滤在单列,两列,三列中具有空值的值.然后在不参加该专栏的情况下与Second一起加入他们.例如- 我首先从First过滤掉只有B为空的值.然后将其与基于"C"和"D"的Second结合在一起. 这样,我将获得许多数据帧,并最终将它们合并.
Can anyone suggest any solution for this query ? Currently I am trying to filter values having null values in single column, two columns, three columns. Then joining them with Second without taking that column. For eg - I first filtered out values having only B as null from First. Then joining it with Second based on "C" and "D". In this way, I will get many dataframes and I will finally union them.
推荐答案
这是您可以做的
import org.apache.spark.sql.functions._
df1.join(broadcast(df2), df1("B") === df2("B") || df1("C") === df2("C") || df1("D") === df2("D"))
.drop(df2("B"))
.drop(df2("C"))
.drop(df2("D"))
.show(false)
为了更加安全,您可以broadcast
尺寸较小的dataframe
.
to be more safe you can broadcast
the dataframe
which is smaller in size.
这篇关于基于非空值加入Spark数据帧(scala)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!