R使用stringdist,stringdistmatrix为地址匹配字符串 [英] R String match for address using stringdist, stringdistmatrix

查看:576
本文介绍了R使用stringdist,stringdistmatrix为地址匹配字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个大数据集,一个大约50万条记录,另一个大约70K。这些数据集具有地址。如果较小数据集中的任何地址存在于较大数据集中,则我想匹配。正如你想象的地址可以写在不同的方式和在不同的情况下,所以很讨厌看到没有匹配时,它应该匹配,并有一个匹配时,它不应该匹配。我做了一些研究,想出了可以使用的包stringdist。然而,我被困,我觉得我没有使用其最全面的功能,一些建议这将有所帮助。



下面是一个示例虚拟数据,创建以解释情况

  Address1 <-c(786,GALI NO 5,XYZ,rambo, strat 4,atlast,pqr,23/4,23RD FLOOR,STREET 2,ABC-E,PQR,45-B,GALI NO5,XYZ,HECTIC,99 STREET,PQR)
df1 < - data.table(Address1)

Address2 <-c(abc,pqr,xyz,786,GALI NO 4 XYZ,45B,GALI NO 5,XYZ ,del,546,strret2,towards east,pqr,23/4,STREET 2,PQR)
df2
df1 [ key_match:= gsub([^ [:alnum:]],,Address1)]
df2 [,key_match:= gsub([^ [:alnum:]], ]

fn_match = function(str,strVec,n){
strVec [amatch(str,strVec,method =dl,maxDist = n,useBytes = T)]
}

df1 [!is.na(key_match)
,address_match:=
fn_match(key_match,df2 $ key_match,3)
]

如果你看到输出,它给出了df1中address_match下的匹配。如果我对我的主要数据应用相同的代码,代码仍然运行从最近30小时。虽然我已经转换为data.table。不确定如何加快速度。



我正在进一步阅读,遇到了stringdist矩阵。这似乎更有帮助,我可以拆分地址基础的空间,并检查每个词在每个地址列表中的存在,并根据最大匹配,可以创建匹配的摘要。但我不是很好的循环。如何从每个字的较小文件中循环每个地址,并检查较大文件中的单个地址并创建匹配矩阵?任何帮助!!

解决方案

我有一个不需要 data.table 但是如果集合是巨大的可以运行 package:parallel

  rbind.pages(
parallel :: mclapply(Address1,function(i){
data.frame(
src = i,
match = Address2 [which.min adid(i,Address2))]

},mc.cores = parallel :: detectCores() - 2))%>%
select(`src(Address1)`= 1,`match(Address2)`= 2)

然后给出输出解:

  src(Address1)match(Address2)
1 786,GALI NO 5,XYZ 786,GALI NO 4 XYZ
2 rambo,45,strret 4,atlast,pqr del,546,strret2,towards east,pqr
3 23/4,23RD FLOOR,STREET 2,ABC-E,PQR 23/4,STREET 2,PQR
4 45-B,GALI NO5,XYZ 45B,GALI NO 5,XYZ
5 HECTIC,99 STREET,PQR 23/4,STREET 2,PQR



编辑:



我意识到,如果没有看到距离计算,这可能不是非常有帮助你可以调整你的需要;所以我将数据复制到更大的随机集,然后修改函数以显示字符串距离计算和处理时间。

  rand_addy_one< ;  -  rep(Address1,1000)[sample(1:1000,1000)] 
rand_addy_two < - rep(Address2,3000)[sample(1:3000,3000)]


system.time({
test_one<< - rbind.pages(parallel :: mclapply(rand_addy_one,function(i){
calc< - as.data.frame (attr(adist(i,rand_addy_two,counts = TRUE),counts)))
calc $ totals< - (rowSums(calc))
calc%>%mutate(src = i ,target = rand_addy_two)%>%
filter(totals == min(totals))
},mc.cores = parallel :: detectCores() - 2))%>%
select(`source Address1` = src,`target Address2(matched)`= target,
insertions = ins,deletions = del,substitution = sub,
total_approx_dist = totals)
}

用户系统已过
24.940 1.480 3.384

> nrow(test_one)
[1] 600000

较小的:

  system.time({
test_two<< - rbind.pages(parallel :: mclapply (rand_addy_two,function(i){
calc< - as.data.frame(drop(attr(adist(i,rand_addy_one,counts = TRUE),counts)))
calc $ totals < - (rowSums(calc))
calc%>%mutate(src = i,target = rand_addy_one)%>%
过滤器(totals == min(totals))
},mc.cores = parallel :: detectCores() - 2))%>%
select(`source Address2` = src,`target Address1(matched)`= target,
insertions = ins ,deletions = del,substitutions = sub,
total_approx_dist = totals)
})

用户系统已过
27.512 1.280 4.077

nrow (test_two)
[1] 720000


I have two large datasets, one around half a million records and the other one around 70K. These datasets have address. I want to match if any of the address in the smaller data set are present in the large one. As you would imagine address can be written in different ways and in different cases so it is quite annoying to see that there is not a match when it should have matched and there is a match when it should not have matched. I did some research and figured out the package stringdist that can be used. However I am stuck and I feel I am not using to its fullest capabilities and some suggestions on this would help.

Below is a sample dummy data along with code that I have created to explain the situation

Address1 <- c("786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr","23/4, 23RD FLOOR, STREET 2, ABC-E, PQR","45-B, GALI NO5, XYZ","HECTIC, 99 STREET, PQR")
df1 <- data.table(Address1)

Address2 <- c("abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR")
df2 <- data.table(Address2)

df1[, key_match := gsub("[^[:alnum:]]", "", Address1)]
df2[, key_match := gsub("[^[:alnum:]]", "", Address2)]

fn_match = function(str, strVec, n){
  strVec[amatch(str, strVec, method = "dl", maxDist=n,useBytes = T)]
}

df1[!is.na(key_match)
       , address_match := 
      fn_match(key_match, df2$key_match,3)
       ]

If you see the output it gives me the matches under address_match in df1. If I apply the same code on my main data, the code is still running from last 30 hours. Though I have converted to data.table. Not sure how I can speed this up.

I was doing further reading and came across stringdist matrix. This seems to be more helpful and I can split the address basis the space and check for presence of each word in each address list and depending upon the maximum match one can create the summary of matches. However I am not very good at loops. How do I loop thru each address from the smaller file for each word and check in individual address in the larger file and create matrix of matches? Any help!!

解决方案

I have a solution that does not require data.table but if the set is huge could run with package:parallel

 rbind.pages(
  parallel::mclapply(Address1, function(i){
    data.frame(
       src = i, 
       match = Address2[which.min(adist(i, Address2))]
     )
   }, mc.cores = parallel::detectCores() - 2)) %>% 
 select(`src (Address1)`= 1, `match (Address2)` = 2)

Which then gives the output solution:

                          src (Address1)                     match (Address2)
1                    786, GALI NO 5, XYZ                   786, GALI NO 4 XYZ
2       rambo, 45, strret 4, atlast, pqr del, 546, strret2, towards east, pqr
3 23/4, 23RD FLOOR, STREET 2, ABC-E, PQR                  23/4, STREET 2, PQR
4                    45-B, GALI NO5, XYZ                  45B, GALI NO 5, XYZ
5                 HECTIC, 99 STREET, PQR                  23/4, STREET 2, PQR

Edit:

I realized that this may not be very helpful without seeing the distance computations so that you may tweak for your needs ; so I replicated the data into larger random sets and then amended the function to show the string distance computations and the processing time

rand_addy_one <- rep(Address1, 1000)[sample(1:1000, 1000)]
rand_addy_two <- rep(Address2, 3000)[sample(1:3000, 3000)]


system.time({
  test_one <<- rbind.pages(parallel::mclapply(rand_addy_one, function(i) {
    calc <- as.data.frame(drop(attr(adist(i, rand_addy_two, counts = TRUE), "counts")))
    calc$totals <- (rowSums(calc))
    calc %>% mutate(src = i, target = rand_addy_two) %>% 
      filter(totals == min(totals))
  }, mc.cores = parallel::detectCores() - 2))  %>% 
    select(`source Address1` = src, `target Address2(matched)` = target,
           insertions = ins, deletions = del, substitutions = sub,
           total_approx_dist = totals)
})

   user  system elapsed 
 24.940   1.480   3.384 

> nrow(test_one)
[1] 600000

Now to reverse and apply the larger set to the smaller:

system.time({
   test_two <<- rbind.pages(parallel::mclapply(rand_addy_two, function(i) {
    calc <- as.data.frame(drop(attr(adist(i, rand_addy_one, counts = TRUE), "counts")))
    calc$totals <- (rowSums(calc))
    calc %>% mutate(src = i, target = rand_addy_one) %>% 
        filter(totals == min(totals))
}, mc.cores = parallel::detectCores() - 2))  %>% 
    select(`source Address2` = src, `target Address1(matched)` = target,
           insertions = ins, deletions = del, substitutions = sub,
           total_approx_dist = totals)
})

   user  system elapsed 
 27.512   1.280   4.077 

nrow(test_two)
[1] 720000

这篇关于R使用stringdist,stringdistmatrix为地址匹配字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆