检测几乎重复的行 [英] Detecting almost duplicate rows
问题描述
比方说,我有一个表,其中包含日期和每个日期的值(加上其他列). 我可以使用
Let's say I have a table that has dates and a value for each date (plus other columns). I can find the rows that have the same value on the same day by using
data.duplicated(subset=["VALUE", "DAY"], keep=False)
现在,假设我要允许一天减少1或2,将值最多减少10,我该怎么做?
Now, say that I want to allow for the day to be off by 1 or 2, and the value to be off by up to 10, how do I do it?
示例:
DAY MTH YYY VALUE NAME
22 9 2016 8.25 John
22 9 2016 43 John
6 11 2016 28.25 Mary
2 10 2016 50 George
23 11 2016 90 George
23 10 2016 30 Jenn
24 8 2016 10 Mike
24 9 2016 10 Mike
24 10 2016 10 Mike
24 11 2016 10 Mike
13 9 2016 170 Kathie
13 10 2016 170 Kathie
13 11 2016 160 Kathie
8 9 2016 16 Gina
9 10 2016 16 Gina
8 11 2016 16 Gina
16 11 2016 25 Ross
21 11 2016 45 Ross
23 9 2016 50 Shari
23 10 2016 50 Shari
23 11 2016 50 Shari
使用上面的代码,我可以找到:
Using the above code I can find:
DAY MTH YYY VALUE NAME
24 8 2016 10 Mike
24 9 2016 10 Mike
24 10 2016 10 Mike
24 11 2016 10 Mike
23 9 2016 50 Shari
23 10 2016 50 Shari
23 11 2016 50 Shari
但是,我还要在8月8日,9月9日和10月8日检测Gina的值16,因为它们具有相同的值,尽管不是同一天,但只有一天的时间.
However, I would like to also detect values 16 for Gina on Aug 8, Sep 9, and Oct 8, because they have same value and, though not same day, it is just a day off.
类似地,我想在9月13日,10月13日和11月13日检测Kathie的值,因为该值仅差10.
Similarly, I want to detect values on Sep 13, Oct 13, and Nov 13 for Kathie because the value is off just by 10.
我该怎么做?
推荐答案
使用numpy
和三角形索引来映射所有组合
use numpy
and triangle indexing to map all combinations
day = df.DAY.values
val = df.VALUE.values
i, j = np.triu_indices(len(df), k=1)
c1 = np.abs(day[i] - day[j]) < 2
c2 = np.abs(val[i] - val[j]) < 10
c = c1 & c2
df.iloc[np.unique(np.append(i[c], j[c]))]
DAY MTH YYY VALUE NAME
1 22 9 2016 43.0 John
6 24 8 2016 10.0 Mike
7 24 9 2016 10.0 Mike
8 24 10 2016 10.0 Mike
9 24 11 2016 10.0 Mike
10 13 9 2016 170.0 Kathie
11 13 10 2016 170.0 Kathie
13 8 9 2016 16.0 Gina
14 9 10 2016 16.0 Gina
15 8 11 2016 16.0 Gina
17 21 11 2016 45.0 Ross
18 23 9 2016 50.0 Shari
19 23 10 2016 50.0 Shari
20 23 11 2016 50.0 Shari
这篇关于检测几乎重复的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!