如何在 pandas 数据框中执行具有多个条件的drop_duplicates [英] how to perform drop_duplicates with multiple condition in a pandas dataframe
问题描述
我有一个df,
Sr.No Name Class Data
0 1 Sri 1 sri is a good player
1 '' Sri 2 sri is good in cricket
2 '' Sri 3 sri went out
3 2 Ram 1 Ram is a good player
4 '' Ram 2 sri is good in cricket
5 '' Ram 3 Ram went out
6 3 Sri 1 sri is a good player
7 '' Sri 2 sri is good in cricket
8 '' Sri 3 sri went out
9 4 Sri 1 sri is a good player
10 '' Sri 2 sri is good in cricket
11 '' Sri 3 sri went out
12 '' Sri 4 sri came back
我正在尝试基于[名称",类",数据"]删除重复项.目标是根据Sr No.的所有句子删除重复项.
I am trying to drop duplicates based on ["Name","Class","Data"]. The goal is to drop duplicates based on all sentences per Sr No.
我的预期输出是
out_df
Sr.No Name Class Data
0 1 Sri 1 sri is a good player
1 Sri 2 sri is good in cricket
2 Sri 3 sri went out
3 2 Ram 1 Ram is a good player
4 Ram 2 sri is good in cricket
5 Ram 3 Ram went out
9 4 Sri 1 sri is a good player
10 Sri 2 sri is good in cricket
11 Sri 3 sri went out
12 Sri 4 sri came back
推荐答案
使用groupby
+ transform
操作创建虚拟列.
Create a dummy column with a groupby
+ transform
operation.
v = df.groupby(df['Class'].diff().le(0).cumsum())['Data'].transform(' '.join)
或者,
v = df['Data'].groupby(df['Class'].diff().le(0).cumsum()).transform(' '.join)
在确定要删除哪些行时,此虚拟列成为一个因素.
This dummy column becomes a factor when deciding what rows are to be dropped.
m = df.assign(Foo=v).duplicated(["Name", "Class", "Data", "Foo"])
df[~m]
Class Data Name Sr.No
0 1 sri is a good player Sri 1
1 2 sri is good in cricket Sri
2 3 sri went out Sri
3 1 Ram is a good player Ram 2
4 2 sri is good in cricket Ram
5 3 Ram went out Ram
9 1 sri is a good player Sri 4
10 2 sri is good in cricket Sri
11 3 sri went out Sri
12 4 sri came back Sri
详细信息
以单调递增的Class
值的形式组-
Form groups from the monotonically increasing Class
values -
i = df['Class'].diff().le(0).cumsum()
i
0 0
1 0
2 0
3 1
4 1
5 1
6 2
7 2
8 2
9 3
10 3
11 3
12 3
Name: Class, dtype: int64
使用此分组,并通过str.join
操作转换Data
-
Use this to group, and transform Data
with a str.join
operation -
v = df.groupby(i)['Data'].transform(' '.join)
这只是连接字符串的一列.最后,分配哑列并调用duplicated
-
Which is simply a column of joined strings. Finally, assign the dummy column and call duplicated
-
m = df.assign(Foo=v).duplicated(["Name", "Class", "Data", "Foo"])
m
0 False
1 False
2 False
3 False
4 False
5 False
6 True
7 True
8 True
9 False
10 False
11 False
12 False
dtype: bool
这篇关于如何在 pandas 数据框中执行具有多个条件的drop_duplicates的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!