在 pandas 数据框中查找重复的行 [英] find duplicate rows in a pandas dataframe

查看：122 发布时间：2020/5/24 2:09:00 python pandas dataframe duplicates

本文介绍了在 pandas 数据框中查找重复的行的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试在熊猫数据框中找到重复的行.

I am trying to find duplicates rows in a pandas dataframe.

df=pd.DataFrame(data=[[1,2],[3,4],[1,2],[1,4],[1,2]],columns=['col1','col2'])

df
Out[15]: 
   col1  col2
0     1     2
1     3     4
2     1     2
3     1     4
4     1     2

duplicate_bool = df.duplicated(subset=['col1','col2'], keep='first')
duplicate = df.loc[duplicate_bool == True]

duplicate
Out[16]: 
   col1  col2
2     1     2
4     1     2

有没有一种方法可以添加引用第一个重复项(保留的重复项)索引的列

Is there a way to add a column referring to the index of the first duplicate (the one kept)

duplicate
Out[16]: 
   col1  col2  index_original
2     1     2               0
4     1     2               0

注意:就我而言，df可能非常大....

Note: df could be very very big in my case....

推荐答案

使用groupby，创建新的索引列，然后调用duplicated:

Use groupby, create a new column of indexes, and then call duplicated:

df['index_original'] = df.groupby(['col1', 'col2']).col1.transform('idxmin')    
df[df.duplicated(subset=['col1','col2'], keep='first')]

   col1  col2  index_original
2     1     2               0
4     1     2               0

详细信息

我groupby前两列，然后调用transform + idxmin获取每个组的第一个索引.

I groupby first two columns and then call transform + idxmin to get the first index of each group.

df.groupby(['col1', 'col2']).col1.transform('idxmin') 

0    0
1    1
2    0
3    3
4    0
Name: col1, dtype: int64

duplicated给了我想要保留的值的布尔掩码:

duplicated gives me a boolean mask of values I want to keep:

df.duplicated(subset=['col1','col2'], keep='first')

0    False
1    False
2     True
3    False
4     True
dtype: bool

其余只是布尔索引.

这篇关于在 pandas 数据框中查找重复的行的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在 pandas 数据框中查找重复的行 [英] find duplicate rows in a pandas dataframe

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

在 pandas 数据框中查找重复的行 [英] find duplicate rows in a pandas dataframe

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭