Pyspark仅保留不同的内容(删除所有重复项) [英] Pyspark retain only distinct (drop all duplicates)

查看：631 发布时间：2020/8/1 19:58:13 join pyspark duplicates

本文介绍了Pyspark仅保留不同的内容(删除所有重复项)的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

加入两个数据框(具有自己的ID)后，我有一些重复项(两个来源都有重复的ID) 我想删除在任一ID上重复的所有行(因此不能保留重复的单次出现)

After joining two dataframes (which have their own ID's) I have some duplicates (repeated ID's from both sources) I want to drop all rows that are duplicates on either ID (so not retain a single occurrence of a duplicate)

我可以对第一个ID进行分组，对count == 1进行计数和过滤，然后对第二个ID重复该操作，然后将这些输出内部联接回到原始联接的数据帧-但这感觉有点长.

I can group by the first ID, do a count and filter for count ==1, then repeat that for the second ID, then inner join these outputs back to the original joined dataframe - but this feels a bit long.

是否有一个更简单的方法，例如dropDuplicates()，但其中没有任何重复项被留下?

Is there a simpler method like dropDuplicates() but where none of the duplicates are left behind?

我看到熊猫可以选择不保留第一个重复的df.drop_duplicates(subset = ['A'，'C']，keep = False)

I see pandas has an option not to keep the first duplicate df.drop_duplicates(subset=['A', 'C'], keep=False)

dropDuplicates()

根据官方

返回删除了重复行的新DataFrame(仅可选) 考虑某些列.

Return a new DataFrame with duplicate rows removed, optionally only considering certain columns.

要考虑所有列删除重复项:

To drop duplicates considering all columns:

df.dropDuplicates()

如果要从某些列中删除重复项

If want to drop duplicates from certain column

df.dropDuplicate(subset=col_name)

对于多列:

df.dropDuplicates(subset=[col_name1, col_name2])

编辑评论

df =  df.agg(criteria_col).agg(sum(lit(1)).alias('freq'))

df = df.filter(col('freq')=1)

这篇关于Pyspark仅保留不同的内容(删除所有重复项)的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Pyspark仅保留不同的内容(删除所有重复项) [英] Pyspark retain only distinct (drop all duplicates)

问题描述

推荐答案

dropDuplicates()

编辑评论

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Pyspark仅保留不同的内容(删除所有重复项) [英] Pyspark retain only distinct (drop all duplicates)

问题描述

推荐答案

dropDuplicates()

编辑评论

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭