R:使用agrep和data.table进行模糊合并 [英] R: Fuzzy merge using agrep and data.table
问题描述
我尝试合并两个data.tables,但是由于股票名称的拼写不同,我丢失了大量的数据点.因此,我正在寻找模糊合并,而不是精确匹配.
I try to merge two data.tables, but due to different spelling in stock names I lose a substantial number of data points. Hence, instead of an exact match I was looking into a fuzzy merge.
library("data.table")
dt1 = data.table(Name = c("ASML HOLDING","ABN AMRO GROUP"), A = c(1,2))
dt2 = data.table(Name = c("ASML HOLDING NV", "ABN AMRO GROUP"), B = c("p", "q"))
在名称"上合并dt1和dt2时,由于添加了"NV",因此将不包括ASML HOLDING,而实际数据将是准确的.
When merging dt1 and dt2 on "Name", ASML HOLDING will be excluded due to the addition of "NV", while the actual data would be accurate.
首选的最终数据输出看起来像:
The prefered final data output would look somthing like:
Name A B
1: ABN AMRO GROUP 2 q
2: ASML HOLDING NV 1 p
接下来我要尝试的是以下内容:
What I tried next was the following:
dt1 = dt1[, dt2_NAME := agrep(dt1$Name, dt2$Name, ignore.case = TRUE, value = TRUE, max.distance = 0.05, useBytes = TRUE)]
但是,出现以下错误
参数'pattern'的长度> 1,并且仅使用第一个元素
argument 'pattern' has length > 1 and only the first element will be used
该错误是合理的,因为dt1 $ Name大于1,但是我认为如果逐行考虑dt1 $ Name,这将是一个可能的解决方案.
The error makes sense as dt1$Name is longer than 1, but I believe it would be a possible solution if it would consider dt1$Name on a row to row basis.
这可能是一个愚蠢的错误,但是由于某种原因,我无法解决这个问题.此外,我更喜欢使用data.table,因为我的数据集相当大,并且到目前为止,它的表现都非常出色.此外,我是堆栈溢出的新手,如果我的问题有所解决,请对不起.
It might be a stupid mistake, but for some reason I just can't get my head around it. Furthermore, I prefer to use data.table as my dataset is fairly large and up till now it has worked splendidly. Additionally, I am new to stack overflow, so sorry if my question is somewhat off.
最后,我找到了一段可以完成工作的代码,但是对于实际使用而言太慢了. R中的模糊合并
Lastly, I found a piece of code which does the job, but is too slow for practical usage. Fuzzy merge in R
dt1$Name_dt2 <- "" # Creating an empty column
for(i in 1:dim(dt1)[1]) {
x <- agrep(dt1$Name[i], dt2$Name,
ignore.case=TRUE, value=TRUE,
max.distance = 0.05, useBytes = TRUE)
x <- paste0(x,"")
dt1$Name_dt2[i] <- x
}
推荐答案
使用'fuzzyjoin'的可能解决方案:
A possible solution using 'fuzzyjoin':
library(fuzzyjoin)
f <- Vectorize(function(x,y) agrepl(x, y,
ignore.case=TRUE,
max.distance = 0.05, useBytes = TRUE))
dt1 %>% fuzzy_inner_join(dt2, by="Name", match_fun=f)
# Name.x A Name.y B
#1 ASML HOLDING 1 ASML HOLDING NV p
#2 ABN AMRO GROUP 2 ABN AMRO GROUP q
注意:您也遇到的主要问题是, agrep
和 agrepl
似乎并不期望第一个参数是向量.这就是为什么我用 Vectorize
包裹通话的原因.
NOTE : The main problem, that you encountered too, was that agrep
and agrepl
don't seem to expect the first argument to be a vector. That's the reason why I wrapped the call with Vectorize
.
该方法可以与等值联接一起使用(请注意中的列顺序,
!):
This method can be used together with an equi-join (mind the order of columns in the by
!):
dt1 = data.frame(Name = c("ASML HOLDING","ABN AMRO GROUP"), A = c(1,2),Date=c(1,2))
dt2 = data.frame(Name = c("ASML HOLDING NV", "ABN AMRO GROUP", "ABN AMRO GROUP"), B = c("p", "q","r"),Date=c(1,2,3))
dt1 %>% fuzzy_inner_join(dt2, by=c("Date","Name"), match_fun=f) %>% filter(Date.x==Date.y)
这篇关于R:使用agrep和data.table进行模糊合并的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!