pandas :根据条件删除一些重复值 [英] Pandas : remove SOME duplicate values based on conditions
问题描述
我有一个数据集:
id url keep_if_dup
1 A.com Yes
2 A.com Yes
3 B.com No
4 B.com No
5 C.com No
我要删除重复项,即保留"url"字段的第一次出现,如果字段"keep_if_dup"为是",则但是保留重复项.
I want to remove duplicates, i.e. keep first occurence of "url" field, BUT keep duplicates if the field "keep_if_dup" is YES.
预期输出:
id url keep_if_dup
1 A.com Yes
2 A.com Yes
3 B.com No
5 C.com No
我尝试过的事情:
Dataframe=Dataframe.drop_duplicates(subset='url', keep='first')
哪个当然不考虑"keep_if_dup"字段.输出为:
which of course does not take into account "keep_if_dup" field. Output is :
id url keep_if_dup
1 A.com Yes
3 B.com No
5 C.com No
推荐答案
您可以将多个布尔条件传递给 loc
,第一个条件将所有行保留在col'keep_if_dup'=='Yes',(使用 |
)进行了或的(使用
|
)的布尔布尔掩码,用于确定是否复制了col'url'列:
You can pass multiple boolean conditions to loc
, the first keeps all rows where col 'keep_if_dup' == 'Yes', this is or
ed (using |
) with the inverted boolean mask of whether col 'url' column is duplicated or not:
In [79]:
df.loc[(df['keep_if_dup'] =='Yes') | ~df['url'].duplicated()]
Out[79]:
id url keep_if_dup
0 1 A.com Yes
1 2 A.com Yes
2 3 B.com No
4 5 C.com No
覆盖您的df自分配:
df = df.loc[(df['keep_if_dup'] =='Yes') | ~df['url'].duplicated()]
分解上面的内容会显示两个布尔掩码:
breaking down the above shows the 2 boolean masks:
In [80]:
~df['url'].duplicated()
Out[80]:
0 True
1 False
2 True
3 False
4 True
Name: url, dtype: bool
In [81]:
df['keep_if_dup'] =='Yes'
Out[81]:
0 True
1 True
2 False
3 False
4 False
Name: keep_if_dup, dtype: bool
这篇关于 pandas :根据条件删除一些重复值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!