查找 pandas 数据框中的所有重复行 [英] Find all duplicate rows in a pandas dataframe

查看:92
本文介绍了查找 pandas 数据框中的所有重复行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我希望能够在不事先知道名称和列数的情况下获取数据集中重复行的所有实例的索引.所以假设我有这个:

I would like to be able to get the indices of all the instances of a duplicated row in a dataset without knowing the name and number of columns beforehand. So assume I have this:

     col
1  |  1
2  |  2
3  |  1
4  |  1
5  |  2

我希望能够获得[1, 3, 4][2, 5].有什么办法可以做到这一点?听起来确实很简单,但是由于我事先不知道各列,所以无法执行df[col == x...]之类的事情.

I'd like to be able to get [1, 3, 4] and [2, 5]. Is there any way to achieve this? It sounds really simple, but since I don't know the columns beforehand I can't do something like df[col == x...].

推荐答案

首先过滤所有 groupby apply或转换index

First filter all duplicated rows and then groupby with apply or convert index to_series:

df = df[df.col.duplicated(keep=False)]

a = df.groupby('col').apply(lambda x: list(x.index))
print (a)
col
1    [1, 3, 4]
2       [2, 5]
dtype: object


a = df.index.to_series().groupby(df.col).apply(list)
print (a)
col
1    [1, 3, 4]
2       [2, 5]
dtype: object

如果需要嵌套列表:

L = df.groupby('col').apply(lambda x: list(x.index)).tolist()
print (L)
[[1, 3, 4], [2, 5]]

如果需要使用,可以通过

If need use only first column is possible selected by position with iloc:

a = df[df.iloc[:,0].duplicated(keep=False)]
      .groupby(df.iloc[:,0]).apply(lambda x: list(x.index))
print (a)
col
1    [1, 3, 4]
2       [2, 5]
dtype: object

这篇关于查找 pandas 数据框中的所有重复行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆