spark:合并两个数据帧，如果ID在两个数据帧中重复，则df1中的行将覆盖df2中的行 [英] spark: merge two dataframes, if ID duplicated in two dataframes, the row in df1 overwrites the row in df2

查看：371 发布时间：2020/9/4 21:14:13 scala dataframe apache-spark apache-spark-sql

本文介绍了spark:合并两个数据帧，如果ID在两个数据帧中重复，则df1中的行将覆盖df2中的行的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

有两个数据帧:具有相同架构的df1和df2. ID是主键.

There are two dataframes: df1, and df2 with the same schema. ID is the primary key.

我需要合并两个df1和df2.可以通过union完成此操作，但有一项特殊要求:如果df1和df2中存在重复的具有相同ID的行.我需要将其保存在df1中.

I need merge the two df1, and df2. This can be done by union except one special requirement: if there are duplicates rows with the same ID in df1 and df2. I need keep the one in df1.

df1:

ID col1 col2
1  AA   2019
2  B    2018

df2:

ID col1 col2
1  A    2019
3  C    2017

我需要以下输出:

df1:

ID col1 col2
1  AA   2019
2  B    2018
3  C    2017

如何执行此操作?谢谢.我认为可以注册两个tmp表，进行完全联接并使用coalesce.但我不喜欢这种方式，因为实际上大约有40列，而不是上面示例中的3列.

How to do this? Thanks. I think it is possible to register two tmp tables, do full joins and use coalesce. but I do not prefer this way, because there are about 40 columns, in fact, instead of 3 in the above example.

推荐答案

鉴于两个DataFrame具有相同的架构，您可以简单地将df1与df2&的left_anti连接合并. df1:

Given that the two DataFrames have the same schema, you could simply union df1 with the left_anti join of df2 & df1:

df1.union(df2.join(df1, Seq("ID"), "left_anti")).show
// +---+---+----+
// | ID|co1|col2|
// +---+---+----+
// |  1| AA|2019|
// |  2|  B|2018|
// |  3|  C|2017|
// +---+---+----+

这篇关于spark:合并两个数据帧，如果ID在两个数据帧中重复，则df1中的行将覆盖df2中的行的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

spark:合并两个数据帧，如果ID在两个数据帧中重复，则df1中的行将覆盖df2中的行 [英] spark: merge two dataframes, if ID duplicated in two dataframes, the row in df1 overwrites the row in df2

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

spark:合并两个数据帧，如果ID在两个数据帧中重复，则df1中的行将覆盖df2中的行 [英] spark: merge two dataframes, if ID duplicated in two dataframes, the row in df1 overwrites the row in df2

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭