如何匹配来自两个数据集的模糊匹配字符串? [英] How can I match fuzzy match strings from two datasets?

查看:299
本文介绍了如何匹配来自两个数据集的模糊匹配字符串?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在研究一种基于不完善的字符串(例如公司名称)来联接两个数据集的方法.过去,我必须匹配两个非常脏的列表,一个列表包含名称和财务信息,另一个列表包含名称和地址.都没有唯一的ID可以匹配! 假定已经应用了清洁,并且可能存在打字和插入内容.

I've been working on a way to join two datasets based on a imperfect string, such as a name of a company. In the past I had to match two very dirty lists, one list had names and financial information, another list had names and address. Neither had unique IDs to match on! ASSUME THAT CLEANING HAS ALREADY BEEN APPLIED AND THERE MAYBE TYPOS AND INSERTIONS.

到目前为止,AGREP是我发现最有效的工具.我可以在AGREP包中使用levenshtein距离,该距离用于测量两个字符串之间的删除,插入和替换的数量. AGREP将返回距离最小(最相似)的字符串.

So far AGREP is the closest tool I've found that might work. I can use levenshtein distances in the AGREP package, which measure the number of deletions, insertions and substitutions between two strings. AGREP will return the string with the smallest distance (the most similar).

但是,我一直无法将这个命令从单个值转换为将其应用于整个数据帧.我已经粗略地使用了for循环来重复AGREP函数,但是总有一种更简单的方法.

However, I've been having trouble turning this command from a single value to apply it to an entire data frame. I've crudely used a for loop to repeat the AGREP function, but there's gotta be an easier way.

请参见以下代码:

a<-data.frame(name=c('Ace Co','Bayes', 'asd', 'Bcy', 'Baes', 'Bays'),price=c(10,13,2,1,15,1))
b<-data.frame(name=c('Ace Co.','Bayes Inc.','asdf'),qty=c(9,99,10))

for (i in 1:6){
    a$x[i] = agrep(a$name[i], b$name, value = TRUE, max = list(del = 0.2, ins = 0.3, sub = 0.4))
    a$Y[i] = agrep(a$name[i], b$name, value = FALSE, max = list(del = 0.2, ins = 0.3, sub = 0.4))
}

推荐答案

解决方案取决于匹配的ab所需的基数.如果是一对一,则会在上方得到三个最接近的匹配项.如果是多对一,您将得到六个.

The solution depends on the desired cardinality of your matching a to b. If it's one-to-one, you will get the three closest matches above. If it's many-to-one, you will get six.

一对一的情况(需要分配算法):

在必须执行此操作之前,我将其视为具有距离矩阵和分配试探法的分配问题(下面使用贪婪分配).如果您想要最佳"的解决方案,最好使用optim.

When I've had to do this before I treat it as an assignment problem with a distance matrix and an assignment heuristic (greedy assignment used below). If you want an "optimal" solution you'd be better off with optim.

不熟悉AGREP,但下面是使用stringdist作为距离矩阵的示例.

Not familiar with AGREP but here's example using stringdist for your distance matrix.

library(stringdist)
d <- expand.grid(a$name,b$name) # Distance matrix in long form
names(d) <- c("a_name","b_name")
d$dist <- stringdist(d$a_name,d$b_name, method="jw") # String edit distance (use your favorite function here)

# Greedy assignment heuristic (Your favorite heuristic here)
greedyAssign <- function(a,b,d){
  x <- numeric(length(a)) # assgn variable: 0 for unassigned but assignable, 
  # 1 for already assigned, -1 for unassigned and unassignable
  while(any(x==0)){
    min_d <- min(d[x==0]) # identify closest pair, arbitrarily selecting 1st if multiple pairs
    a_sel <- a[d==min_d & x==0][1] 
    b_sel <- b[d==min_d & a == a_sel & x==0][1] 
    x[a==a_sel & b == b_sel] <- 1
    x[x==0 & (a==a_sel|b==b_sel)] <- -1
  }
  cbind(a=a[x==1],b=b[x==1],d=d[x==1])
}
data.frame(greedyAssign(as.character(d$a_name),as.character(d$b_name),d$dist))

产生作业:

       a          b       d
1 Ace Co    Ace Co. 0.04762
2  Bayes Bayes Inc. 0.16667
3    asd       asdf 0.08333

我敢肯定,有一种更优雅的方法来进行贪婪的分配试探法,但是上面的方法对我有用.

I'm sure there's a much more elegant way to do the greedy assignment heuristic, but the above works for me.

多对一案例(不是分配问题):

do.call(rbind, unname(by(d, d$a_name, function(x) x[x$dist == min(x$dist),])))

产生结果:

   a_name     b_name    dist
1  Ace Co    Ace Co. 0.04762
11   Baes Bayes Inc. 0.20000
8   Bayes Bayes Inc. 0.16667
12   Bays Bayes Inc. 0.20000
10    Bcy Bayes Inc. 0.37778
15    asd       asdf 0.08333

编辑:使用method="jw"产生所需的结果.参见help("stringdist-package")

use method="jw" to produce desired results. See help("stringdist-package")

这篇关于如何匹配来自两个数据集的模糊匹配字符串?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆