查找仅限多个范围的重复项- pandas [英] Find Duplicates limited to multiple ranges - pandas

查看:112
本文介绍了查找仅限多个范围的重复项- pandas 的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我们的问题可以这样简化:

Suppose our problem can be simplified like this:

df = pd.DataFrame()
df['C_rows'] = ['C1', 'C2', 'C3', 'C2', 'C1', 'C2', 'C3', 'C1', 'C2', 'C3', 'C4', 'C1']
df['values'] = ['customer1', 4321, 1266, 5671, 'customer2', 123, 7344,'customer3', 4321, 4444, 5674, 'customer4']

与表格:

    C_rows  values
0   C1      customer1
1   C2      4321
2   C3      1266
3   C2      5671
4   C1      customer2
5   C2      123
6   C3      7344
7   C1      customer3
8   C2      4321
9   C3      4444
10  C4      5674
11  C1      customer4

我们如何向量化找到每个C1之间的重复C_rows, 即row3在第1行和第3行中出现重复的C2. 我正在使用的数据集有50,000行,每个C1之间大约有15行.

How can we vectorise finding duplicate C_rows between each C1, i.e. row3 has duplicate C2 occurring in rows 1 and 3. The dataset I am working with has 50,000 rows, and between each C1 is about 15 rows.

例如检查重复项,如下所示:

e.g. check duplicates like this:

    C_rows  values
0   C1      customer1
1   C2      4321
2   C3      1266
3   C2      5671

C2是重复的

4   C1      customer2
5   C2      123
6   C3      7344

无重复

7   C1      customer3
8   C2      4321
9   C3      4444
10  C4      5674

无重复

不使用for循环-快速(向量化).

without using for loops - and quick (vectorised).

推荐答案

对于非常快速的矢量化解决方案,请在C1之间使用连续的值创建新的笨拙的对象,然后检查

For very fast vectorized solution create new clumn by consecutive values between C1 and then check duplicated:

df['dupe'] = df.assign(dupe=df['C_rows'].eq('C1').cumsum()).duplicated(['C_rows','dupe'])
print (df)
   C_rows     values   dupe
0      C1  customer1  False
1      C2       4321  False
2      C3       1266  False
3      C2       5671   True
4      C1  customer2  False
5      C2        123  False
6      C3       7344  False
7      C1  customer3  False
8      C2       4321  False
9      C3       4444  False
10     C4       5674  False
11     C1  customer4  False

如果需要过滤器:

df = df[df.assign(dupe=df['C_rows'].eq('C1').cumsum()).duplicated(['C_rows','dupe'])]
print (df)
  C_rows values
3     C2   5671

如果要检查重复组:

df = df.assign(dupe=df['C_rows'].eq('C1').cumsum())
a = df.loc[df.duplicated(['C_rows','dupe']), 'dupe']
df['dupe'] = df['dupe'].isin(a)
print (df)
   C_rows     values   dupe
0      C1  customer1   True
1      C2       4321   True
2      C3       1266   True
3      C2       5671   True
4      C1  customer2  False
5      C2        123  False
6      C3       7344  False
7      C1  customer3  False
8      C2       4321  False
9      C3       4444  False
10     C4       5674  False
11     C1  customer4  False

这篇关于查找仅限多个范围的重复项- pandas 的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆