如何有条件地从 pandas 数据框中删除重复项 [英] How to conditionally remove duplicates from a pandas dataframe

查看:70
本文介绍了如何有条件地从 pandas 数据框中删除重复项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

考虑以下数据框

import pandas as pd
df = pd.DataFrame({'A' : [1, 2, 3, 3, 4, 4, 5, 6, 7],
                   'B' : ['a','b','c','c','d','d','e','f','g'],
                   'Col_1' :[np.NaN, 'A','A', np.NaN, 'B', np.NaN, 'B', np.NaN, np.NaN],
                   'Col_2' :[2,2,3,3,3,3,4,4,5]})
df
Out[92]: 
    A  B Col_1  Col_2
 0  1  a   NaN      2
 1  2  b     A      2
 2  3  c     A      3
 3  3  c   NaN      3
 4  4  d     B      3
 5  4  d   NaN      3
 6  5  e     B      4
 7  6  f   NaN      4
 8  7  g   NaN      5

我想删除与列'A' 'B'有关的所有重复行.我想删除具有NaN条目的条目(我知道所有重复的条目都会有NaN和not- NaN条目).最终结果应该像这样

I want to remove all rows which are duplicates with regards to column 'A' 'B'. I want to remove the entry which has a NaN entry (I know that for all dulicates there will be a NaN and a not-NaN entry). The end results should look like this

    A  B Col_1  Col_2
 0  1  a   NaN      2
 1  2  b     A      2
 2  3  c     A      3
 4  4  d     B      3
 6  5  e     B      4
 7  6  f   NaN      4
 8  7  g   NaN      5

所有高效,一线客舱都受到欢迎

All efficient, one-liners are most welcome

推荐答案

以下是替代方法:

df[~((df[['A', 'B']].duplicated(keep=False)) & (df.isnull().any(axis=1)))]
#    A  B Col_1  Col_2
# 0  1  a   NaN      2
# 1  2  b     A      2
# 2  3  c     A      3
# 4  4  d     B      3
# 6  5  e     B      4
# 7  6  f   NaN      4
# 8  7  g   NaN      5

这使用按位非"运算符~来消除满足作为重复行的联合条件的行(参数keep=False导致该方法对所有非唯一行的值都为True),并且包含至少一个空值.因此,表达式df[['A', 'B']].duplicated(keep=False)返回此Series:

This uses the bitwise "not" operator ~ to negate rows that meet the joint condition of being a duplicate row (the argument keep=False causes the method to evaluate to True for all non-unique rows) and containing at least one null value. So where the expression df[['A', 'B']].duplicated(keep=False) returns this Series:

# 0    False
# 1    False
# 2     True
# 3     True
# 4     True
# 5     True
# 6    False
# 7    False
# 8    False

...并且表达式df.isnull().any(axis=1)返回该系列:

...and the expression df.isnull().any(axis=1) returns this Series:

# 0     True
# 1    False
# 2    False
# 3     True
# 4    False
# 5     True
# 6    False
# 7     True
# 8     True

...我们将两者都括在括号中(在索引操作中使用多个表达式时,Pandas语法均要求),然后再次将其包裹在括号中[em> ,以便我们可以否定整个表达式(即~( ... )),就像这样:

... we wrap both in parentheses (required by Pandas syntax whenever using multiple expressions in indexing operations), and then wrap them in parentheses again so that we can negate the entire expression (i.e. ~( ... )), like so:

~((df[['A','B']].duplicated(keep=False)) & (df.isnull().any(axis=1))) & (df['Col_2'] != 5)

# 0     True
# 1     True
# 2     True
# 3    False
# 4     True
# 5    False
# 6     True
# 7     True
# 8    False

您可以通过进一步使用逻辑运算符&|(或"运算符)来建立更复杂的条件.与SQL一样,根据需要用附加括号将条件分组.例如,根据逻辑"条件X 条件Y 为真,或者条件Z 为真"的逻辑与df[ ( (X) & (Y) ) | (Z) ]进行过滤.

You can build more complex conditions with further use of the logical operators & and | (the "or" operator). As with SQL, group your conditions as necessary with additional parentheses; for instance, filter based on the logic "both condition X AND condition Y are true, or condition Z is true" with df[ ( (X) & (Y) ) | (Z) ].

这篇关于如何有条件地从 pandas 数据框中删除重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆