使用 Pandas 在不同组中查找重复行 [英] Find duplicate rows among different groups with pandas
问题描述
考虑以下数据框:
data_so = {
'ID': [100, 100, 100, 200, 200, 300, 300, 300],
'letter': ['A','B','A','C','D','E','D','A'],
}
df_so = pandas.DataFrame (data_so, columns = ['ID', 'letter'])
我想获得一个新列,其中不同组中的所有重复项都为真.同一组中的所有其他重复项都应为 False.
I want to obtain a new column where all duplicates in different groups are True. All other duplicates in the same group should be False.
我试过使用
df_so['dup'] = df_so.duplicated(subset=['letter'], keep=False)
但结果不是我想要的:
第 1 组(第 0 行)中 A 的第一次出现是 True
,因为在另一组(第 7 行)中存在重复项.但是,同一组(第 2 行)中出现的所有 other A 都应该是 False
.
The first occurrence of A in group 1 (row 0) is True
because there is a duplicate in another group (row 7). However all other occurrences of A in the same group (row 2) should be False
.
如果第 7 行被删除,那么第 0 行应该是 False
因为 A 不再存在于任何其他组中.
If row 7 is deleted, then row 0 should be False
because A is not present anymore in any other group.
推荐答案
你需要的是两个不同的 duplicated()
调用的AND
.
What you need is essentially the AND
of two different duplicated()
calls.
~df_so.duplicated()
组内交易
df_so.drop_duplicates().duplicated(subset='letter',keep=False).fillna(True)
忽略当前组重复项的组之间的交易
df_so.drop_duplicates().duplicated(subset='letter',keep=False).fillna(True)
Deals between groups ignoring current group duplicates
代码:
import pandas as pd
data_so = { 'ID': [100, 100, 100, 200, 200, 300, 300, 300], 'letter': ['A','B','A','C','D','E','D','A'], }
df_so = pd.DataFrame (data_so, columns = ['ID', 'letter'])
df_so['dup'] = ~df_so.duplicated() & df_so.drop_duplicates().duplicated(subset='letter',keep=False).fillna(True)
print(df_so)
输出:
ID letter dup
0 100 A True
1 100 B False
2 100 A False
3 200 C False
4 200 D True
5 300 E False
6 300 D True
7 300 A True
其他情况:
data_so = { 'ID': [100, 100, 100, 200, 200, 300, 300], 'letter': ['A','B','A','C','D','E','D'] }
输出:
ID letter dup
0 100 A False
1 100 B False
2 100 A False
3 200 C False
4 200 D True
5 300 E False
6 300 D True
这篇关于使用 Pandas 在不同组中查找重复行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!