单个列表中的近似字符串匹配-r [英] approximate string matching within single list - r

查看:119
本文介绍了单个列表中的近似字符串匹配-r的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在一个长列表中包含成千上万个名称的数据框中有一个列表.许多名称之间的差异很小,因此略有不同.我想找到一种匹配这些名称的方法.例如:

I have a list in a data frame of thousands of names in a long list. Many of the names have small differences in them which make them slightly different. I would like to find a way to match these names. For example:

names <- c('jon smith','jon, smith','Jon Smith','jon smith et al','bob seger','bob, seger','bobby seger','bob seger jr.')

我已经在stringdist函数以及agrep中查看了amatch,但是这些都需要

I've looked at amatch in the stringdist function, as well as agrep, but these all require a master list of names that are used to match another list of names against. In my case, I don't have such a master list so I'd like to create one from the data by identifying names with highly similar patterns so I can look at them and decide whether they're the same person (which in many cases they are). I'd like an output in a new column that helps me to know these are a likely match, and maybe some sort of similarity score based on Levenshtein distance or something. Maybe something like this:

            names   match      SimilarityScore
1       jon smith     a               9
2      jon, smith     a               8
3       Jon Smith     a               9
4 jon smith et al     a               5
5       bob seger     b               9
6      bob, seger     b               8
7     bobby seger     b               7
8   bob seger jr.     b               5

这样可能吗?

推荐答案

借鉴发现的帖子

Drawing upon the post found here I have found that hierarchical text clustering will do what I'm looking for.

  names <- c('jon smith','jon, smith','Jon Smith','jon smith et al','bob seger','bob, seger','bobby seger','bob seger jr.','jake','jakey','jack','jakeyfied')

# Levenshtein Distance
e  <- adist(names)
rownames(e) <- names
hc <- hclust(as.dist(e))
plot(hc)
rect.hclust(hc,k=3) #the k value provides the number of clusters
df <- data.frame(names,cutree(hc,k=3))

如果选择正确数量的群集(在这种情况下为三个),输出看起来会非常好:

The output looks really good if you pick the right number of clusters (three in this case):

                       names             cutree.hc..k...3.
jon smith             jon smith                 1
jon, smith           jon, smith                 1
Jon Smith             Jon Smith                 1
jon smith et al jon smith et al                 1
bob seger             bob seger                 2
bob, seger           bob, seger                 2
bobby seger         bobby seger                 2
bob seger jr.     bob seger jr.                 2
jake                       jake                 3
jakey                     jakey                 3
jack                       jack                 3
jakeyfied             jakeyfied                 3

但是,名称通常比这复杂得多,在添加了一些困难的名称之后,我发现默认的adist选项不能提供最佳的聚类效果:

However, names are oftentimes more complex than this, and after adding a few more difficult names, I found that the default adist options didn't give the best clustering:

names <- c('jon smith','jon, smith','Jon Smith','jon smith et al','bob seger','bob, seger','bobby seger','bob seger jr.','jake','jakey','jack','jakeyfied','1234 ranch','5678 ranch','9983','7777')

d  <- adist(names)
rownames(d) <- names
hc <- hclust(as.dist(d))
plot(hc)
rect.hclust(hc,k=6)

我可以通过将替换值的成本增加到2并将插入和删除成本保持为1并忽略大小写来对此进行改进.这有助于将完全不同的四个字符数字字符串的错误分组减至最少,而我不想将其分组:

I was able to improve upon this by increasing the cost of the substitution value to 2 and leaving the insertion and deletion costs at 1, and ignoring case. This helped to to minimize the mistaken grouping of totally different four character number strings, which I didn't want grouped:

d  <- adist(names,ignore.case=TRUE, costs=c(i=1,d=1,s=2)) #i=insertion, d=deletion s=substitution
rownames(d) <- names
hc <- hclust(as.dist(d))
plot(hc)
rect.hclust(hc,k=6

我通过使用grep软件包中的gsub工具删除了常见术语(例如"ranch"和"et al"),并对簇进行了微调,并将簇数增加了一个:

I further fine tuned the clustering by removing common terms such as "ranch" and "et al" using the gsub tool in the grep package and increasing the number of clusters by one:

names<-gsub("ranch","",names)
names<-gsub("et al","",names)
d  <- adist(names,ignore.case=TRUE, costs=c(i=1,d=1,s=2))
rownames(d) <- names
hc <- hclust(as.dist(d))
plot(hc)
rect.hclust(hc,k=7)

尽管有一些方法可以让数据整理出最佳的群集数量,而不是手动尝试选择数量,但我发现尽管有信息

Although there are methods to let the data sort out the best number of clusters instead of manually trying to pick the number, I found that it was easiest to use trial and error, although there is information here about that approach.

这篇关于单个列表中的近似字符串匹配-r的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆