spark:合并两个数据帧,如果ID在两个数据帧中重复,则df1中的行将覆盖df2中的行 [英] spark: merge two dataframes, if ID duplicated in two dataframes, the row in df1 overwrites the row in df2
问题描述
有两个数据帧:具有相同架构的df1和df2. ID是主键.
There are two dataframes: df1, and df2 with the same schema. ID is the primary key.
我需要合并两个df1和df2.可以通过union
完成此操作,但有一项特殊要求:如果df1和df2中存在重复的具有相同ID的行.我需要将其保存在df1中.
I need merge the two df1, and df2. This can be done by union
except one special requirement: if there are duplicates rows with the same ID in df1 and df2. I need keep the one in df1.
df1:
ID col1 col2
1 AA 2019
2 B 2018
df2:
ID col1 col2
1 A 2019
3 C 2017
我需要以下输出:
df1:
ID col1 col2
1 AA 2019
2 B 2018
3 C 2017
如何执行此操作?谢谢.我认为可以注册两个tmp表,进行完全联接并使用coalesce
.但我不喜欢这种方式,因为实际上大约有40列,而不是上面示例中的3列.
How to do this? Thanks. I think it is possible to register two tmp tables, do full joins and use coalesce
. but I do not prefer this way, because there are about 40 columns, in fact, instead of 3 in the above example.
推荐答案
鉴于两个DataFrame具有相同的架构,您可以简单地将df1
与df2
&的left_anti
连接合并. df1
:
Given that the two DataFrames have the same schema, you could simply union df1
with the left_anti
join of df2
& df1
:
df1.union(df2.join(df1, Seq("ID"), "left_anti")).show
// +---+---+----+
// | ID|co1|col2|
// +---+---+----+
// | 1| AA|2019|
// | 2| B|2018|
// | 3| C|2017|
// +---+---+----+
这篇关于spark:合并两个数据帧,如果ID在两个数据帧中重复,则df1中的行将覆盖df2中的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!