在 R 中使用模糊/近似字符串匹配合并两个数据帧 [英] Merging two Data Frames using Fuzzy/Approximate String Matching in R

查看:76
本文介绍了在 R 中使用模糊/近似字符串匹配合并两个数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

描述

我有两个包含需要合并的信息的数据集.我拥有的唯一常见字段是不完全匹配的字符串和可能有很大不同的数字字段

I have two datasets with information that I need to merge. The only common fields that I have are strings that do not perfectly match and a numerical field that can be substantially different

解释问题的唯一方法是向您展示数据.这是 a.csvb.csv.我正在尝试将 B 合并到 A.

The only way to explain the problem is to show you the data. Here is a.csv and b.csv. I am trying to merge B to A.

B 中有三个字段,A 中有四个字段.公司名称(仅限文件 A)、基金名称、资产类别和资产.到目前为止,我的重点一直是尝试通过替换单词或部分字符串来创建精确匹配,然后使用:

There are three fields in B and four in A. Company Name (File A Only), Fund Name, Asset Class, and Assets. So far, my focus has been on attempting to match the Fund Names by replacing words or parts of the strings to create exact matches and then using:

a <- read.table(file = "http://bertelsen.ca/R/a.csv",header=TRUE, sep=",", na.strings=F, strip.white=T, blank.lines.skip=F, stringsAsFactors=T) 
b <- read.table(file = "http://bertelsen.ca/R/b.csv",header=TRUE, sep=",", na.strings=F, strip.white=T, blank.lines.skip=F, stringsAsFactors=T)
merge(a,b, by="Fund.Name") 

然而,这只会让我达到大约 30% 的匹配.剩下的我必须手工完成.

However, this only brings me to about 30% matching. The rest I have to do by hand.

Assets 是一个数字字段,两者都不总是正确的,如果基金资产较少,则可能会有很大差异.Asset Class 是一个字符串字段,在两个文件中通常"相同,但是存在差异.

Assets is a numerical field that is not always correct in either and can vary wildly if the fund has low assets. Asset Class is a string field that is "generally" the same in both files, however, there are discrepancies.

更复杂的是文件 B 中不同系列的基金.例如:

Adding to the complication are the different series of funds, in File B. For example:

AGF 加拿大价值

AGF 加拿大价值-D

AGF Canadian Value-D

在这些情况下,我必须选择不连续的那个,或者选择称为A"、-A"或Advisor"的那个作为匹配.

In these cases, I have to choose the one that is not seried, or choose the one that is called "A", "-A", or "Advisor" as the match.

问题

你认为最好的方法是什么?这个练习是我必须每月做的事情,手动匹配它们非常耗时.代码示例会很有帮助.

What would you say is the best approach? This excercise is something that I have to do on a monthly basis and matching them manually is incredibly time consuming. Examples of code would be instrumental.

想法

我认为可行的一种方法是根据字符串中每个单词的第一个大写字母对字符串进行规范化.但我一直无法弄清楚如何使用 R 实现这一点.

One method that I think may work is normalizing the strings based on the first capitalized letter of each word in the string. But I haven't been able to figure out how to pull that off using R.

我考虑的另一种方法是根据资产、基金名称、资产类别和公司的组合创建匹配索引.但同样,我不知道如何用 R 做到这一点.或者,就此而言,如果可能的话.

Another method I considered was creating an index of matches based on a combination of assets, fund name, asset class and company. But again, I'm not sure how to do this with R. Or, for that matter, if it's even possible.

非常感谢代码示例、注释、想法和方向!

Examples of code, comments, thoughts and direction are greatly appreciated!

推荐答案

近似字符串匹配不是一个好主意,因为不正确的匹配会使整个分析无效.如果每个来源的名称每次都相同,那么构建索引对我来说似乎也是最好的选择.这在 R 中很容易完成:

Approximate string matching is not a good idea since an incorrect match would invalidate the whole analysis. If the names from each source is the same each time, then building indexes seems the best option to me too. This is easily done in R:

假设你有数据:

a<-data.frame(name=c('Ace','Bayes'),price=c(10,13))
b<-data.frame(name=c('Ace Co.','Bayes Inc.'),qty=c(9,99))

为每个来源建立一次名称索引,可能使用 pmatch 等作为起点,然后手动验证.

Build an index of names for each source one time, perhaps using pmatch etc. as a starting point and then validating manually.

a.idx<-data.frame(name=c('Ace','Bayes'),idx=c(1,2))
b.idx<-data.frame(name=c('Ace Co.','Bayes Inc.'), idx=c(1,2))

然后对于每次运行合并使用:

Then for each run merge using:

a.rich<-merge(a,a.idx,by="name")
b.rich<-merge(b,b.idx,by="name")
merge(a.rich,b.rich,by="idx")

这会给我们:

  idx name.x price     name.y qty
1   1    Ace    10    Ace Co.   9
2   2  Bayes    13 Bayes Inc.  99

这篇关于在 R 中使用模糊/近似字符串匹配合并两个数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆