如何在Spark Dataframe中过滤具有多个密钥的重复记录? [英] How to filter duplicate records having multiple key in Spark Dataframe?
本文介绍了如何在Spark Dataframe中过滤具有多个密钥的重复记录?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
例如:
数据帧-A:
ABCD
1 2 3 4
3 4 5 7
4 7 9 6
2 5 7 9
数据帧-B: p>
ABCD
1 2 3 7
2 5 7 4
2 9 8 7
键:A,B,C列
期望的输出:
ABCD
3 4 5 7
4 7 9 6
任何解决方案。
解决方案
您正在寻找 left anti-join
:
df_a.join(df_b,Seq(A ,B,C),leftanti)。show()
+ --- + --- + --- + --- +
| A | B | C | D |
+ --- + --- + --- + --- +
| 3 | 4 | 5 | 7 |
| 4 | 7 | 9 | 6 |
+ --- + --- + --- + --- +
I have two dataframes. I want to delete some records in Data Frame-A based on some common column values in Data Frame-B.
For Example: Data Frame-A:
A B C D
1 2 3 4
3 4 5 7
4 7 9 6
2 5 7 9
Data Frame-B:
A B C D
1 2 3 7
2 5 7 4
2 9 8 7
Keys: A,B,C columns
Desired Output:
A B C D
3 4 5 7
4 7 9 6
Any solution for this.
解决方案
You are looking for left anti-join
:
df_a.join(df_b, Seq("A","B","C"), "leftanti").show()
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
| 3| 4| 5| 7|
| 4| 7| 9| 6|
+---+---+---+---+
这篇关于如何在Spark Dataframe中过滤具有多个密钥的重复记录?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文