Pyspark仅保留不同的内容(删除所有重复项) [英] Pyspark retain only distinct (drop all duplicates)
问题描述
加入两个数据框(具有自己的ID)后,我有一些重复项(两个来源都有重复的ID) 我想删除在任一ID上重复的所有行(因此不能保留重复的单次出现)
After joining two dataframes (which have their own ID's) I have some duplicates (repeated ID's from both sources) I want to drop all rows that are duplicates on either ID (so not retain a single occurrence of a duplicate)
我可以对第一个ID进行分组,对count == 1进行计数和过滤,然后对第二个ID重复该操作,然后将这些输出内部联接回到原始联接的数据帧-但这感觉有点长.
I can group by the first ID, do a count and filter for count ==1, then repeat that for the second ID, then inner join these outputs back to the original joined dataframe - but this feels a bit long.
是否有一个更简单的方法,例如dropDuplicates(),但其中没有任何重复项被留下?
Is there a simpler method like dropDuplicates() but where none of the duplicates are left behind?
我看到熊猫可以选择不保留第一个重复的df.drop_duplicates(subset = ['A','C'],keep = False)
I see pandas has an option not to keep the first duplicate df.drop_duplicates(subset=['A', 'C'], keep=False)
推荐答案
dropDuplicates()
df = df.agg(criteria_col).agg(sum(lit(1)).alias('freq'))
df = df.filter(col('freq')=1)
这篇关于Pyspark仅保留不同的内容(删除所有重复项)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!