Pyspark仅保留不同的内容(删除所有重复项) [英] Pyspark retain only distinct (drop all duplicates)

查看:631
本文介绍了Pyspark仅保留不同的内容(删除所有重复项)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

加入两个数据框(具有自己的ID)后,我有一些重复项(两个来源都有重复的ID) 我想删除在任一ID上重复的所有行(因此不能保留重复的单次出现)

After joining two dataframes (which have their own ID's) I have some duplicates (repeated ID's from both sources) I want to drop all rows that are duplicates on either ID (so not retain a single occurrence of a duplicate)

我可以对第一个ID进行分组,对count == 1进行计数和过滤,然后对第二个ID重复该操作,然后将这些输出内部联接回到原始联接的数据帧-但这感觉有点长.

I can group by the first ID, do a count and filter for count ==1, then repeat that for the second ID, then inner join these outputs back to the original joined dataframe - but this feels a bit long.

是否有一个更简单的方法,例如dropDuplicates(),但其中没有任何重复项被留下?

Is there a simpler method like dropDuplicates() but where none of the duplicates are left behind?

我看到熊猫可以选择不保留第一个重复的df.drop_duplicates(subset = ['A','C'],keep = False)

I see pandas has an option not to keep the first duplicate df.drop_duplicates(subset=['A', 'C'], keep=False)

推荐答案

dropDuplicates()

根据官方

返回删除了重复行的新DataFrame(仅可选) 考虑某些列.

Return a new DataFrame with duplicate rows removed, optionally only considering certain columns.

要考虑所有列删除重复项:

To drop duplicates considering all columns:

df.dropDuplicates()

如果要从某些列中删除重复项

If want to drop duplicates from certain column

df.dropDuplicate(subset=col_name)

对于多列:

df.dropDuplicates(subset=[col_name1, col_name2])

编辑评论

df =  df.agg(criteria_col).agg(sum(lit(1)).alias('freq'))

df = df.filter(col('freq')=1)

这篇关于Pyspark仅保留不同的内容(删除所有重复项)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆