在pandas DataFrame中查找重复行的索引 [英] Find indices of duplicate rows in pandas DataFrame
问题描述
在给定DataFrame中查找相同行的索引而不迭代单个行的熊猫方式是什么?
What is the pandas way of finding the indices of identical rows within a given DataFrame without iterating over individual rows?
虽然可以用unique = df[df.duplicated()]
查找所有唯一的行,然后用unique.iterrows()
遍历唯一的条目,并借助pd.where()
提取相等条目的索引,但是熊猫的作法是什么?
While it is possible to find all unique rows with unique = df[df.duplicated()]
and then iterating over the unique entries with unique.iterrows()
and extracting the indices of equal entries with help of pd.where()
, what is the pandas way of doing it?
示例: 给定具有以下结构的DataFrame:
Example: Given a DataFrame of the following structure:
| param_a | param_b | param_c
1 | 0 | 0 | 0
2 | 0 | 2 | 1
3 | 2 | 1 | 1
4 | 0 | 2 | 1
5 | 2 | 1 | 1
6 | 0 | 0 | 0
输出:
[(1, 6), (2, 4), (3, 5)]
推荐答案
使用参数 duplicated
,对所有重复行使用keep=False
,然后对所有列进行groupby
并将索引值转换为元组,最后将输出Series
转换为list
:>
Use parameter duplicated
with keep=False
for all dupe rows and then groupby
by all columns and convert index values to tuples, last convert output Series
to list
:
df = df[df.duplicated(keep=False)]
df = df.groupby(list(df).apply(lambda x: tuple(x.index)).tolist()
print (df)
[(1, 6), (2, 4), (3, 5)]
如果您还希望看到重复值:
If you want also see dupe values:
df1 = (df.groupby(df.columns.tolist())
.apply(lambda x: tuple(x.index))
.reset_index(name='idx'))
print (df1)
param_a param_b param_c idx
0 0 0 0 (1, 6)
1 0 2 1 (2, 4)
2 2 1 1 (3, 5)
这篇关于在pandas DataFrame中查找重复行的索引的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!