查找仅限多个范围的重复项- pandas [英] Find Duplicates limited to multiple ranges - pandas
本文介绍了查找仅限多个范围的重复项- pandas 的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
假设我们的问题可以这样简化:
Suppose our problem can be simplified like this:
df = pd.DataFrame()
df['C_rows'] = ['C1', 'C2', 'C3', 'C2', 'C1', 'C2', 'C3', 'C1', 'C2', 'C3', 'C4', 'C1']
df['values'] = ['customer1', 4321, 1266, 5671, 'customer2', 123, 7344,'customer3', 4321, 4444, 5674, 'customer4']
与表格:
C_rows values
0 C1 customer1
1 C2 4321
2 C3 1266
3 C2 5671
4 C1 customer2
5 C2 123
6 C3 7344
7 C1 customer3
8 C2 4321
9 C3 4444
10 C4 5674
11 C1 customer4
我们如何向量化找到每个C1
之间的重复C_rows
,
即row3
在第1行和第3行中出现重复的C2
.
我正在使用的数据集有50,000行,每个C1
之间大约有15行.
How can we vectorise finding duplicate C_rows
between each C1
,
i.e. row3
has duplicate C2
occurring in rows 1 and 3.
The dataset I am working with has 50,000 rows, and between each C1
is about 15 rows.
例如检查重复项,如下所示:
e.g. check duplicates like this:
C_rows values
0 C1 customer1
1 C2 4321
2 C3 1266
3 C2 5671
C2是重复的
4 C1 customer2
5 C2 123
6 C3 7344
无重复
7 C1 customer3
8 C2 4321
9 C3 4444
10 C4 5674
无重复
不使用for循环-快速(向量化).
without using for loops - and quick (vectorised).
推荐答案
对于非常快速的矢量化解决方案,请在C1
之间使用连续的值创建新的笨拙的对象,然后检查
For very fast vectorized solution create new clumn by consecutive values between C1
and then check duplicated
:
df['dupe'] = df.assign(dupe=df['C_rows'].eq('C1').cumsum()).duplicated(['C_rows','dupe'])
print (df)
C_rows values dupe
0 C1 customer1 False
1 C2 4321 False
2 C3 1266 False
3 C2 5671 True
4 C1 customer2 False
5 C2 123 False
6 C3 7344 False
7 C1 customer3 False
8 C2 4321 False
9 C3 4444 False
10 C4 5674 False
11 C1 customer4 False
如果需要过滤器:
df = df[df.assign(dupe=df['C_rows'].eq('C1').cumsum()).duplicated(['C_rows','dupe'])]
print (df)
C_rows values
3 C2 5671
如果要检查重复组:
df = df.assign(dupe=df['C_rows'].eq('C1').cumsum())
a = df.loc[df.duplicated(['C_rows','dupe']), 'dupe']
df['dupe'] = df['dupe'].isin(a)
print (df)
C_rows values dupe
0 C1 customer1 True
1 C2 4321 True
2 C3 1266 True
3 C2 5671 True
4 C1 customer2 False
5 C2 123 False
6 C3 7344 False
7 C1 customer3 False
8 C2 4321 False
9 C3 4444 False
10 C4 5674 False
11 C1 customer4 False
这篇关于查找仅限多个范围的重复项- pandas 的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文