在pandas DataFrame中查找重复行的索引 [英] Find indices of duplicate rows in pandas DataFrame

查看:871
本文介绍了在pandas DataFrame中查找重复行的索引的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在给定DataFrame中查找相同行的索引而不迭代单个行的熊猫方式是什么?

What is the pandas way of finding the indices of identical rows within a given DataFrame without iterating over individual rows?

虽然可以用unique = df[df.duplicated()]查找所有唯一的行,然后用unique.iterrows()遍历唯一的条目,并借助pd.where()提取相等条目的索引,但是熊猫的作法是什么?

While it is possible to find all unique rows with unique = df[df.duplicated()] and then iterating over the unique entries with unique.iterrows() and extracting the indices of equal entries with help of pd.where(), what is the pandas way of doing it?

示例: 给定具有以下结构的DataFrame:

Example: Given a DataFrame of the following structure:

  | param_a | param_b | param_c
1 | 0       | 0       | 0
2 | 0       | 2       | 1
3 | 2       | 1       | 1
4 | 0       | 2       | 1
5 | 2       | 1       | 1
6 | 0       | 0       | 0

输出:

[(1, 6), (2, 4), (3, 5)]

推荐答案

使用参数 duplicated ,对所有重复行使用keep=False,然后对所有列进行groupby并将索引值转换为元组,最后将输出Series转换为list:

Use parameter duplicated with keep=False for all dupe rows and then groupby by all columns and convert index values to tuples, last convert output Series to list:

df = df[df.duplicated(keep=False)]

df = df.groupby(list(df).apply(lambda x: tuple(x.index)).tolist()
print (df)
[(1, 6), (2, 4), (3, 5)]

如果您还希望看到重复值:

If you want also see dupe values:

df1 = (df.groupby(df.columns.tolist())
       .apply(lambda x: tuple(x.index))
       .reset_index(name='idx'))
print (df1)
   param_a  param_b  param_c     idx
0        0        0        0  (1, 6)
1        0        2        1  (2, 4)
2        2        1        1  (3, 5)

这篇关于在pandas DataFrame中查找重复行的索引的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆