R-基于不一致的全名格式的部分匹配来合并两个数据文件 [英] R - Merging two data files based on partial matching of inconsistent full name formats

查看:143
本文介绍了R-基于不一致的全名格式的部分匹配来合并两个数据文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我以前的问题,以R格式重新发布.

Here is my previous question reposted with R format.

我正在寻找一种基于参与者全名的部分匹配来合并两个数据文件的方法,这些参与者的全名有时以不同的格式输入并且有时会拼写错误.我知道部分匹配(例如agrep和pmatch)和合并数据文件有一些不同的功能选项,但我需要a)结合两者的帮助; b)做可以忽略中间名的部分匹配; c)在合并数据文件中存储原始名称格式和d)即使它们不匹配,也保留唯一的值.

I'm looking for a way to merge two data files based on partial matching of participants' full names that are sometimes entered in different formats and sometimes misspelled. I know there are some different function options for partial matches (eg agrep and pmatch) and for merging data files but I need help with a) combining the two; b) doing partial matching that can ignore middle names; c) in the merged data file store both original name formats and d) retain unique values even if they don't have a match.

例如,我有以下两个数据文件:

For example, I have the following two data files:

文件名:员工数据(R中的df1)

File name: Employee Data (df1 in R)

       Full.Name Date.Started Orders
1  ANGELA MUIR     6/15/14 25     44
2  EILEEN COWIE      6/15/14      40
3  LAURA CUMMING     10/6/14      43
4    ELENA POPA       1/21/15     37
5 KAREN MACEWAN       3/15/99     39

文件名:评估数据(R中的df2)

File name: Assessment data (df2 in R)

           Candidate Leading.Factor SI.D SI.I
1         Angie muir              I   -3   12
2      Caroline Burn              S   -5   -3
3  Eileen Mary Cowie              S   -5    5
4         Elena Pope              C   -4    7
5     Henry LeFeuvre              C   -5   -1
6      Jennifer Ford              S   -3   -2
7       Karen McEwan              I   -4   10
8      Laura Cumming              S    0    6
9       Mandip Johal              C   -2    2
10   Mubarak Hussain              D    6   -1

我想根据名称(df1中的全名和df2中的候选)将它们合并,而忽略中间名(例如Eilen Cowie = Eileen Mary Cowie),多余的空格(Laura Cumming = Laura Cumming);拼写错误(例如Elena Popa = Elena Pope)等.

I want to merge them based on names (Full Name in df1 and Candidate in df2) ignoring middle name (eg Eilen Cowie = Eileen Mary Cowie), extra spaces (Laura Cumming = Laura Cumming); misspells (e.g. Elena Popa = Elena Pope) etc.

理想的输出应如下所示:

The ideal output would look like this:

              Name      Full.Name         Candidate Date.Started Orders Leading.Factor SI.D SI.I
1    ANGELA MUIR    ANGELA MUIR          Angie muir   6/15/14 25     44              I   -3   12
2    EILEEN COWIE   EILEEN COWIE  Eileen Mary Cowie     6/15/14      40              S   -5    5
3    LAURA CUMMING  LAURA CUMMING     Laura Cumming     10/6/14      43              S    0    6
4      ELENA POPA     ELENA POPA         Elena Pope      1/21/15     37              C   -4    7
5   KAREN MACEWAN  KAREN MACEWAN       Karen McEwan      3/15/99     39              I   -4   10
6    Caroline Burn            N/A     Caroline Burn          N/A    N/A              S   -5   -3
7   Henry LeFeuvre            N/A    Henry LeFeuvre          N/A    N/A              C   -5   -1
8    Jennifer Ford            N/A     Jennifer Ford          N/A    N/A              S   -3   -2
9     Mandip Johal            N/A      Mandip Johal          N/A    N/A              C   -2    2
10 Mubarak Hussain            N/A   Mubarak Hussain          N/A    N/A              D    6   -1

任何建议将不胜感激!

推荐答案

对于第一步,我建议分两个阶段进行.

For a first pass, I would suggest a two-stage process.

首先,清洁您的琴弦.标准化大小写,删除多余的空格,删除所有不需要的字符.我用于相当积极的清洁的功能如下:

First, clean your strings. Normalize the casing, strip out extra spaces, strip out any unwanted characters. A function I use for a fairly aggressive cleaning is below:

stringCleaning <- function(x) {
#   x <- stringr::str_trim(x)
#   x <- tolower(x)
#   x <- gsub("\\s+", " ", x)
#   x <- gsub("[^[:space:]A-z0-9]", "", x)
  stringr::str_trim(tolower(gsub("\\s+", " ", gsub("[^[:space:]A-z0-9]", "", x))))
}

这会将字符串转换为小写字母,去除所有非字母数字或字符串字符,去除多余的空格,并删除字符串两侧的空格.

This converts strings to lowercase, strips out any non-alphanumeric or string characters, strips out extra spaces, and removes spaces on either side of the string.

第二,使用Levenshtein(或编辑)距离查找最接近的匹配项. stringdist软件包包括一个简单的距离计算器来帮助您.

Two, use Levenshtein (or edit) distances to find your closest matches. The stringdist package includes a simple distance calculator to help you.

stringdist::stringdist('your mother', c("bellow", "your mom", 'yourmother'))
min(stringdist::stringdist('your mother', c("bellow", "your mom", 'yourmother')))
which.min(stringdist::stringdist('your mother', c("bellow", "your mom", 'yourmother')))

您可以使用此功能在另一个数据框中找到最合适的匹配项.

You can use this function to find the most appropriate match in another dataframe.

df1 <- data.frame(name = c("Jena Stars", "Gina Starz"))
df2 <- data.frame(name = c("gina starz", "Jena starz  "))

df1$clean <- stringCleaning(df1$name)
df2$clean <- stringCleaning(df2$name)

df1$check <- df2$name[sapply(df1$clean, function(x) {
  which.min(stringdist::stringdist(x, df2$clean))
  })]
df1

一个小例子,但我希望它会有所帮助.

A small example, but I hope it's helpful.

这篇关于R-基于不一致的全名格式的部分匹配来合并两个数据文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆