如何基于Pandas数据框中的两个或多个子集条件删除重复项 [英] How to drop duplicates based on two or more subsets criteria in Pandas data-frame

查看:90
本文介绍了如何基于Pandas数据框中的两个或多个子集条件删除重复项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

让我们说这是我的数据框

Lets say this is my data-frame

df = pd.DataFrame({ 'bio' : ['1', '1', '1', '4'],
                'center' : ['one', 'one', 'two', 'three'],
                'outcome' : ['f','t','f','f'] })

看起来像这样...

  bio center outcome
0   1    one       f
1   1    one       t
2   1    two       f
3   4  three       f

我要删除第1行,因为它具有相同的生物&居中作为第0行. 我想保留第2行,因为它具有相同的生物但中心与第0行不同.

I want to drop row 1 because it has the same bio & center as row 0. I want to keep row 2 because it has the same bio but different center then row 0.

基于drop_duplicates输入结构,类似的操作将无法正常工作,但这是我正在尝试的操作

Something like this won't work based on drop_duplicates input structure but it's what I am trying to do

df.drop_duplicates(subset = 'bio' & subset = 'center' )

有什么建议吗?

edit:对df进行了一些更改,以使其符合正确答案的示例

edit : changed df a bit to fit example by correct answer

推荐答案

您的语法错误.这是正确的方法:

Your syntax is wrong. Here's the correct way:

df.drop_duplicates(subset=['bio', 'center', 'outcome'])

或者在这种情况下,只需:

Or in this specific case, just simply:

df.drop_duplicates()

两者都返回以下内容:

  bio center outcome
0   1    one       f
2   1    two       f
3   4  three       f

看看df.drop_duplicates 文档以获得语法详细信息. subset应该是列标签的序列.

Take a look at the df.drop_duplicates documentation for syntax details. subset should be a sequence of column labels.

这篇关于如何基于Pandas数据框中的两个或多个子集条件删除重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆