R中的近似字符串匹配 [英] Approximate String Matching in R

查看:42
本文介绍了R中的近似字符串匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为了我的研究,我必须匹配两个包含基金信息的数据集.不幸的是,没有通用标识符.好消息是我在两个文件中都有一个文件编号的标识符,但可以包含多个资金.如果文档中有多个基金(例如 20 个),我只能通过基金名称进行匹配,有时可能会略有不同.请注意,每个文档的资金数量在任何数据集中都是相同的.稍微搜索后,我尝试使用此功能(在这里找到:agrep:只返回最佳匹配(es)):

for my research I have to match two data sets containing fund information. Unfortunately there is no common identifier. The good thing is that I have an identifier in both documents for the document number which however can contain multiple funds. If there are multiple funds in the document (e.g. 20) I can only match via the fund's name which can differ sometimes slightly. Note that the number of funds per document is identical in noth data sets. After searching a little bit I tried to use this function(found here: agrep: only return best match(es)):

ClosestMatch2 = function(string, stringVector){

  distance = levenshteinSim(string, stringVector);
  stringVector[distance == max(distance)]

}

这对大多数基金都有效,但我发现了两个问题:

This worked fine for most funds, however I discovered two problems:

  1. 有时有多个匹配项
  2. 有时我有错误的匹配

例如:该功能将INSTITUTIONAL LARGE CORE FUND"匹配到Transamerica Partners Institutional Core Bond",而不是Transamerica Partners Institutional Large Core".

For example: This function matched "INSTITUTIONAL LARGE CORE FUND" to "Transamerica Partners Institutional Core Bond" instead of "Transamerica Partners Institutional Large Core".

我有两个想法来规避这些问题:

I have two ideas to circumvent these problems:

  1. 我使用另一个匹配函数来验证上面的函数.IE.如果两个函数产生相同的结果,我只接受匹配.
  2. 我以某种方式调整了上面的功能.

非常感谢您的帮助.最好的事物,劳伦斯

I would really appreciate your help. Best, Laurenz

推荐答案

RecordLinkage 包允许您使用多种方法(例如 levenshtein 以及其他度量)匹配字符串,它允许您定义阈值甚至使用分类模型来指示何时匹配适合您.

The RecordLinkage package allows you to match strings with several approaches (e.g. levenshtein but also other measures) and it allows you to define thresholds or even the use of classification model to indicated when an match is ok for you.

这篇关于R中的近似字符串匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆