同时进行模糊和非模糊连接 [英] Simultaneous fuzzy and non-fuzzy join

查看:88
本文介绍了同时进行模糊和非模糊连接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

说我有这个数据框:

# Set random seed
set.seed(33550336)

# Number of IDs
n <- 5

# Create data frames
df <- data.frame(ID = rep(1:n, each = 10), 
                 loc = seq(10, 100, by =10))
#    ID loc
# 1   1  10
# 2   1  20
# 3   1  30
# 4   1  40
# 5   1  50
# 6   1  60
# 7   1  70
# 8   1  80
# 9   1  90
# 10  1 100
# 11  2  10
# 12  2  20
# 13  2  30
# 14  2  40
# 15  2  50
# 16  2  60
# 17  2  70
# 18  2  80
# 19  2  90
# 20  2 100
# 21  3  10
# 22  3  20
# 23  3  30
# 24  3  40
# 25  3  50
# 26  3  60
# 27  3  70
# 28  3  80
# 29  3  90
# 30  3 100
# 31  4  10
# 32  4  20
# 33  4  30
# 34  4  40
# 35  4  50
# 36  4  60
# 37  4  70
# 38  4  80
# 39  4  90
# 40  4 100
# 41  5  10
# 42  5  20
# 43  5  30
# 44  5  40
# 45  5  50
# 46  5  60
# 47  5  70
# 48  5  80
# 49  5  90
# 50  5 100

现在,我要加入第二个数据框:

Now, I have a second data frame that I'd like to join to it:

df_alt <- data.frame(ID = rep(1:n, each = 10), 
                     loc = sample(1:100, 5 * n, replace = TRUE), 
                     value = runif(n))
#    ID loc     value
# 1   1  87 0.3202490
# 2   1  36 0.4724253
# 3   1  53 0.4750352
# 4   1   7 0.8744985
# 5   1  38 0.2016645
# 6   1  92 0.3202490
# 7   1  74 0.4724253
# 8   1  72 0.4750352
# 9   1  73 0.8744985
# 10  1  95 0.2016645
# 11  2  61 0.3202490
# 12  2   5 0.4724253
# 13  2  87 0.4750352
# 14  2  11 0.8744985
# 15  2  10 0.2016645
# 16  2  25 0.3202490
# 17  2  60 0.4724253
# 18  2  62 0.4750352
# 19  2  52 0.8744985
# 20  2  31 0.2016645
# 21  3   3 0.3202490
# 22  3  43 0.4724253
# 23  3  45 0.4750352
# 24  3  91 0.8744985
# 25  3  51 0.2016645
# 26  3  87 0.3202490
# 27  3  36 0.4724253
# 28  3  53 0.4750352
# 29  3   7 0.8744985
# 30  3  38 0.2016645
# 31  4  92 0.3202490
# 32  4  74 0.4724253
# 33  4  72 0.4750352
# 34  4  73 0.8744985
# 35  4  95 0.2016645
# 36  4  61 0.3202490
# 37  4   5 0.4724253
# 38  4  87 0.4750352
# 39  4  11 0.8744985
# 40  4  10 0.2016645
# 41  5  25 0.3202490
# 42  5  60 0.4724253
# 43  5  62 0.4750352
# 44  5  52 0.8744985
# 45  5  31 0.2016645
# 46  5   3 0.3202490
# 47  5  43 0.4724253
# 48  5  45 0.4750352
# 49  5  91 0.8744985
# 50  5  51 0.2016645

我想完美匹配ID和最接近loc.我查看了fuzzyjoin包,但是不幸的是,对于不同的列,您不能具有不同级别的模糊性.也就是说,我不能为ID指定完美匹配,而为loc指定模糊匹配.因此,作为一项变通方法,我按ID进行左连接,计算loc.xloc.y之间的距离(即分别来自dfdf_alt数据帧的loc s),组按IDloc.x,按loc s之间的距离排序,并采用第一行(即最短距离):

I'd like a perfect match for ID and the closest match for loc. I looked at the fuzzyjoin package, but unfortunately you cannot have different levels of fuzziness for different columns. That is, I cannot specify a perfect match for ID and a fuzzy match for loc. So, as a work around I do a left join by ID, calculate the distance between loc.x and loc.y (i.e., locs from the df and df_alt data frames, respectively), group by ID and loc.x, sort by distance between locs, and take the first row (i.e., the shortest distance):

# Bind and find nearest
df_res <- df %>% 
  left_join(df_alt, by = "ID") %>% 
  mutate(delta = abs(loc.x - loc.y)) %>% 
  group_by(ID, loc.x) %>% 
  arrange(delta) %>% 
  filter(row_number() == 1) %>% 
  ungroup %>% 
  arrange(ID, loc.x)

# # A tibble: 50 x 5
#       ID loc.x loc.y value delta
#     <int> <dbl> <int> <dbl> <dbl>
# 1      1    10     7 0.874     3
# 2      1    20     7 0.874    13
# 3      1    30    36 0.472     6
# 4      1    40    38 0.202     2
# 5      1    50    53 0.475     3
# 6      1    60    53 0.475     7
# 7      1    70    72 0.475     2
# 8      1    80    74 0.472     6
# 9      1    90    92 0.320     2
# 10     1   100    95 0.202     5
# 11     2    10    10 0.202     0
# 12     2    20    25 0.320     5
# 13     2    30    31 0.202     1
# 14     2    40    31 0.202     9
# 15     2    50    52 0.874     2
# 16     2    60    60 0.472     0
# 17     2    70    62 0.475     8
# 18     2    80    87 0.475     7
# 19     2    90    87 0.475     3
# 20     2   100    87 0.475    13
# 21     3    10     7 0.874     3
# 22     3    20     7 0.874    13
# 23     3    30    36 0.472     6
# 24     3    40    38 0.202     2
# 25     3    50    51 0.202     1
# 26     3    60    53 0.475     7
# 27     3    70    87 0.320    17
# 28     3    80    87 0.320     7
# 29     3    90    91 0.874     1
# 30     3   100    91 0.874     9
# 31     4    10    10 0.202     0
# 32     4    20    11 0.874     9
# 33     4    30    11 0.874    19
# 34     4    40    61 0.320    21
# 35     4    50    61 0.320    11
# 36     4    60    61 0.320     1
# 37     4    70    72 0.475     2
# 38     4    80    74 0.472     6
# 39     4    90    92 0.320     2
# 40     4   100    95 0.202     5
# 41     5    10     3 0.320     7
# 42     5    20    25 0.320     5
# 43     5    30    31 0.202     1
# 44     5    40    43 0.472     3
# 45     5    50    51 0.202     1
# 46     5    60    60 0.472     0
# 47     5    70    62 0.475     8
# 48     5    80    91 0.874    11
# 49     5    90    91 0.874     1
# 50     5   100    91 0.874     9

这不是特别有效,但是可以提供所需的结果.当数据帧变大时会出现问题.用足够大的n重新运行上述代码会产生以下错误:

This isn't particularly efficient, but gives the desired result. The problem arises when the data frame gets large. Rerunning the above code with a sufficiently large n produces the following error:

错误:无法分配大小向量...

Error: cannot allocate vector of size...

我认为这是因为左联接正在产生不必要的巨大数据帧.显然,join-then-filter不是最佳策略.但是,同时进行模糊和非模糊连接的最佳方法是什么?

I think this is because the left join is producing an unnecessarily huge data frame. Clearly, join-then-filter isn't the best strategy. But what is the best way to do a fuzzy and non-fuzzy join simultaneously?

推荐答案

在我看来,软件包最适合此工作:

In my opinion the data.table package is best suited for this job:

library(data.table)
setDT(df)
setDT(df_alt)

df_alt[df
       , on = .(ID, loc)
       , roll = "nearest"
       , .(ID, loc.x = i.loc, loc.y = x.loc, value, delta = abs(i.loc - x.loc))]

给出:

    ID loc.x loc.y     value delta
 1:  1    10     7 0.8744985     3
 2:  1    20     7 0.8744985    13
 3:  1    30    36 0.4724253     6
 4:  1    40    38 0.2016645     2
 5:  1    50    53 0.4750352     3
 6:  1    60    53 0.4750352     7
 7:  1    70    72 0.4750352     2
 8:  1    80    74 0.4724253     6
 9:  1    90    92 0.3202490     2
10:  1   100    95 0.2016645     5
11:  2    10    10 0.2016645     0
12:  2    20    25 0.3202490     5
13:  2    30    31 0.2016645     1
14:  2    40    31 0.2016645     9
15:  2    50    52 0.8744985     2
16:  2    60    60 0.4724253     0
17:  2    70    62 0.4750352     8
18:  2    80    87 0.4750352     7
19:  2    90    87 0.4750352     3
20:  2   100    87 0.4750352    13
21:  3    10     7 0.8744985     3
22:  3    20     7 0.8744985    13
23:  3    30    36 0.4724253     6
24:  3    40    38 0.2016645     2
25:  3    50    51 0.2016645     1
26:  3    60    53 0.4750352     7
27:  3    70    53 0.4750352    17
28:  3    80    87 0.3202490     7
29:  3    90    91 0.8744985     1
30:  3   100    91 0.8744985     9
31:  4    10    10 0.2016645     0
32:  4    20    11 0.8744985     9
33:  4    30    11 0.8744985    19
34:  4    40    61 0.3202490    21
35:  4    50    61 0.3202490    11
36:  4    60    61 0.3202490     1
37:  4    70    72 0.4750352     2
38:  4    80    74 0.4724253     6
39:  4    90    92 0.3202490     2
40:  4   100    95 0.2016645     5
41:  5    10     3 0.3202490     7
42:  5    20    25 0.3202490     5
43:  5    30    31 0.2016645     1
44:  5    40    43 0.4724253     3
45:  5    50    51 0.2016645     1
46:  5    60    60 0.4724253     0
47:  5    70    62 0.4750352     8
48:  5    80    91 0.8744985    11
49:  5    90    91 0.8744985     1
50:  5   100    91 0.8744985     9

这篇关于同时进行模糊和非模糊连接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆