检测几乎重复的行 [英] Detecting almost duplicate rows

查看:81
本文介绍了检测几乎重复的行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

比方说,我有一个表,其中包含日期和每个日期的值(加上其他列). 我可以使用

Let's say I have a table that has dates and a value for each date (plus other columns). I can find the rows that have the same value on the same day by using

data.duplicated(subset=["VALUE", "DAY"], keep=False)

现在,假设我要允许一天减少1或2,将值最多减少10,我该怎么做?

Now, say that I want to allow for the day to be off by 1 or 2, and the value to be off by up to 10, how do I do it?

示例:

DAY MTH YYY VALUE   NAME
22  9   2016    8.25    John
22  9   2016    43      John
6   11  2016    28.25   Mary
2   10  2016    50  George
23  11  2016    90  George
23  10  2016    30  Jenn
24  8   2016    10  Mike
24  9   2016    10  Mike
24  10  2016    10  Mike
24  11  2016    10  Mike
13  9   2016    170 Kathie
13  10  2016    170 Kathie
13  11  2016    160 Kathie
8   9   2016    16  Gina
9   10  2016    16  Gina
8   11  2016    16  Gina
16  11  2016    25  Ross
21  11  2016    45  Ross
23  9   2016    50  Shari
23  10  2016    50  Shari
23  11  2016    50  Shari

使用上面的代码,我可以找到:

Using the above code I can find:

DAY MTH YYY VALUE   NAME
24  8   2016    10  Mike
24  9   2016    10  Mike
24  10  2016    10  Mike
24  11  2016    10  Mike
23  9   2016    50  Shari
23  10  2016    50  Shari
23  11  2016    50  Shari

但是,我还要在8月8日,9月9日和10月8日检测Gina的值16,因为它们具有相同的值,尽管不是同一天,但只有一天的时间.

However, I would like to also detect values 16 for Gina on Aug 8, Sep 9, and Oct 8, because they have same value and, though not same day, it is just a day off.

类似地,我想在9月13日,10月13日和11月13日检测Kathie的值,因为该值仅差10.

Similarly, I want to detect values on Sep 13, Oct 13, and Nov 13 for Kathie because the value is off just by 10.

我该怎么做?

推荐答案

使用numpy和三角形索引来映射所有组合

use numpy and triangle indexing to map all combinations

day = df.DAY.values
val = df.VALUE.values

i, j = np.triu_indices(len(df), k=1)
c1 = np.abs(day[i] - day[j]) < 2
c2 = np.abs(val[i] - val[j]) < 10

c = c1 & c2
df.iloc[np.unique(np.append(i[c], j[c]))]

    DAY  MTH   YYY  VALUE    NAME
1    22    9  2016   43.0    John
6    24    8  2016   10.0    Mike
7    24    9  2016   10.0    Mike
8    24   10  2016   10.0    Mike
9    24   11  2016   10.0    Mike
10   13    9  2016  170.0  Kathie
11   13   10  2016  170.0  Kathie
13    8    9  2016   16.0    Gina
14    9   10  2016   16.0    Gina
15    8   11  2016   16.0    Gina
17   21   11  2016   45.0    Ross
18   23    9  2016   50.0   Shari
19   23   10  2016   50.0   Shari
20   23   11  2016   50.0   Shari

这篇关于检测几乎重复的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆