R:我必须在String中做Softmatch [英] R: I have to do Softmatch in String

查看:168
本文介绍了R:我必须在String中做Softmatch的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须在给定输入字符串的一列数据框中进行softmatch,比如

  col <-c (John Collingson,J Collingson,Dummy Name1,Dummy Name2)

inputText< - J Collingson
#副Versa
inputText < - John Collingson

我想检索John Collingson& Collingson提供的colnamecol



善意帮助

解决方案

如果你只有一点数据,那么 agrep 绝对是一个快速简单的base R解决方案。如果这只是一个较大数据框的玩具示例,那么您可能对更耐用的工具感兴趣。在过去的一个月中,学习@PaulHiemstra提到的Levenshtein距离(也参见这些不同的问题 )将我带到了 RecordLinkage 包。这些小插曲让我想要更多的软或模糊匹配的例子,特别是在1个以上的领域,但是你的问题的基本答案可能是这样的:

<$ p

library(RecordLinkage)
col < - data.frame(names1 = c(John Collingson,J Collingson,Dummy Name1,Dummy Name2) )
inputText< - data.frame(names2 = c(J Collingson))
g1 < - compare.linkage(inputText,col,strcmp = T)
g2< - epiWeights(g1)
getPairs(g2,min.weight = 0.6)
#id names2重量
#1 1 J Collingson
#2 2 J Collingson 1.000
#3
#4 1 J Collingson
#5 1 John Collingson 0.815

inputText2< - data.frame(names2 = c(Jon Collinson))
g1 < - compare.linkage(inputText2,col,strcmp = T)
g2 <-epiWeights(g1)
getPairs(g2,min.weight = 0.6)
#id names2体重
#1 1 Jon Collinson
#2 1 John Collingson 0.9644444
#3
#4 1 Jon Collinson
#5 2 J Collingson 0.7924825

对于大数据集,请从compare.linkage()或compare.dedup() - RLBigDataLinkage()或RLBigDataDedup()开始。希望这有助于。


I have to do softmatch in one column of data frame with the given input string, like

col <- c("John Collingson","J Collingson","Dummy Name1","Dummy Name2")

inputText <- "J Collingson"
#Vice-Versa
inputText <- "John Collingson"

I want to retrieve both "John Collingson" & "J Collingson" from the provided colname "col"

Kindly help

解决方案

agrep is definitely a quick and easy base R solution if you have just a bit of data. If this is just a toy example of a larger data frame, you may be interested in a more durable tool. In the past month, learning about the Levenshtein distance noted by @PaulHiemstra (also in these different questions) led me to the RecordLinkage package. The vignettes leave me wanting more examples of the "soft" or fuzzy" matches, particularly across more than 1 field, but the basic answer to your question could be somthing like:

library(RecordLinkage)
col <- data.frame(names1 = c("John Collingson","J Collingson","Dummy Name1","Dummy Name2"))
inputText <- data.frame(names2 = c("J Collingson"))
g1 <- compare.linkage(inputText, col, strcmp = T)
g2 <- epiWeights(g1)
getPairs(g2, min.weight=0.6) 
# id          names2 Weight
# 1  1    J Collingson       
# 2  2    J Collingson  1.000
# 3                          
# 4  1    J Collingson       
# 5  1 John Collingson  0.815

inputText2 <- data.frame(names2 = c("Jon Collinson"))
g1 <- compare.linkage(inputText2, col, strcmp = T)
g2 <- epiWeights(g1)
getPairs(g2, min.weight=0.6)
# id          names2    Weight
# 1  1   Jon Collinson          
# 2  1 John Collingson 0.9644444
# 3                             
# 4  1   Jon Collinson          
# 5  2    J Collingson 0.7924825

Please start with compare.linkage() or compare.dedup()-- RLBigDataLinkage() or RLBigDataDedup() for large data sets. Hope this helps.

这篇关于R:我必须在String中做Softmatch的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆