如何有条件地从 pandas 数据框中删除重复项 [英] How to conditionally remove duplicates from a pandas dataframe
问题描述
考虑以下数据框
import pandas as pd
df = pd.DataFrame({'A' : [1, 2, 3, 3, 4, 4, 5, 6, 7],
'B' : ['a','b','c','c','d','d','e','f','g'],
'Col_1' :[np.NaN, 'A','A', np.NaN, 'B', np.NaN, 'B', np.NaN, np.NaN],
'Col_2' :[2,2,3,3,3,3,4,4,5]})
df
Out[92]:
A B Col_1 Col_2
0 1 a NaN 2
1 2 b A 2
2 3 c A 3
3 3 c NaN 3
4 4 d B 3
5 4 d NaN 3
6 5 e B 4
7 6 f NaN 4
8 7 g NaN 5
我想删除与列'A' 'B'
有关的所有重复行.我想删除具有NaN
条目的条目(我知道所有重复的条目都会有NaN
和not- NaN
条目).最终结果应该像这样
I want to remove all rows which are duplicates with regards to column 'A' 'B'
. I want to remove the entry which has a NaN
entry (I know that for all dulicates there will be a NaN
and a not-NaN
entry). The end results should look like this
A B Col_1 Col_2
0 1 a NaN 2
1 2 b A 2
2 3 c A 3
4 4 d B 3
6 5 e B 4
7 6 f NaN 4
8 7 g NaN 5
所有高效,一线客舱都受到欢迎
All efficient, one-liners are most welcome
推荐答案
以下是替代方法:
df[~((df[['A', 'B']].duplicated(keep=False)) & (df.isnull().any(axis=1)))]
# A B Col_1 Col_2
# 0 1 a NaN 2
# 1 2 b A 2
# 2 3 c A 3
# 4 4 d B 3
# 6 5 e B 4
# 7 6 f NaN 4
# 8 7 g NaN 5
这使用按位非"运算符~
来消除满足作为重复行的联合条件的行(参数keep=False
导致该方法对所有非唯一行的值都为True),并且包含至少一个空值.因此,表达式df[['A', 'B']].duplicated(keep=False)
返回此Series:
This uses the bitwise "not" operator ~
to negate rows that meet the joint condition of being a duplicate row (the argument keep=False
causes the method to evaluate to True for all non-unique rows) and containing at least one null value. So where the expression df[['A', 'B']].duplicated(keep=False)
returns this Series:
# 0 False
# 1 False
# 2 True
# 3 True
# 4 True
# 5 True
# 6 False
# 7 False
# 8 False
...并且表达式df.isnull().any(axis=1)
返回该系列:
...and the expression df.isnull().any(axis=1)
returns this Series:
# 0 True
# 1 False
# 2 False
# 3 True
# 4 False
# 5 True
# 6 False
# 7 True
# 8 True
...我们将两者都括在括号中(在索引操作中使用多个表达式时,Pandas语法均要求),然后再次将其包裹在括号中[em> ,以便我们可以否定整个表达式(即~( ... )
),就像这样:
... we wrap both in parentheses (required by Pandas syntax whenever using multiple expressions in indexing operations), and then wrap them in parentheses again so that we can negate the entire expression (i.e. ~( ... )
), like so:
~((df[['A','B']].duplicated(keep=False)) & (df.isnull().any(axis=1))) & (df['Col_2'] != 5)
# 0 True
# 1 True
# 2 True
# 3 False
# 4 True
# 5 False
# 6 True
# 7 True
# 8 False
您可以通过进一步使用逻辑运算符&
和|
(或"运算符)来建立更复杂的条件.与SQL一样,根据需要用附加括号将条件分组.例如,根据逻辑"条件X 和条件Y 为真,或者条件Z 为真"的逻辑与df[ ( (X) & (Y) ) | (Z) ]
进行过滤.
You can build more complex conditions with further use of the logical operators &
and |
(the "or" operator). As with SQL, group your conditions as necessary with additional parentheses; for instance, filter based on the logic "both condition X AND condition Y are true, or condition Z is true" with df[ ( (X) & (Y) ) | (Z) ]
.
这篇关于如何有条件地从 pandas 数据框中删除重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!