通过R中变量的模糊匹配进行合并 [英] Merging through fuzzy matching of variables in R

查看:302
本文介绍了通过R中变量的模糊匹配进行合并的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个数据帧(x和y),其中ID为student_namefather_namemother_name.由于存在印刷错误("n"而不是"m",随机的空格等),尽管我可以查看数据并看到应有的值,但我仍有大约60%的值未对齐.有没有办法以某种方式减少不匹配的程度,以便至少由于可行而手动编辑?数据帧有大约70万个观测值.

R最好.我知道一些python和一些基本的unix工具. P.S.我阅读了agrep(),但不了解它如何在实际数据集中起作用,尤其是当匹配项超过一个以上变量时.


更新(发布赏金的数据):

此处是两个示例数据帧,分别为sites_asites_b.它们可以在数字列latlon以及sitename列上匹配.了解如何在a)仅lat + lon,b)sitename或c)两者上完成此操作将很有用.

您可以获取文件 test_sites.R .作为要点发布.

理想情况下,答案以

结尾

merge(sites_a, sites_b, by = **magic**)

解决方案

agrep函数(基础R的一部分),该函数使用

R would be best. I know a little bit of python, and some basic unix tools. P.S. I read up on agrep(), but don't understand how that can work on actual datasets, especially when the match is over more than one variable.


update (data for posted bounty):

Here are two example data frames, sites_a and sites_b. They could be matched on the numeric columns lat and lon as well as on the sitename column. It would be useful to know how this could be done on a) just lat + lon, b) sitename or c) both.

you can source the file test_sites.R which is posted as a gist.

Ideally the answer would end with

merge(sites_a, sites_b, by = **magic**)

解决方案

The agrep function (part of base R), which does approximate string matching using the Levenshtein edit distance is probably worth trying. Without knowing what your data looks like, I can't really suggest a working solution. But this is a suggestion... It records matches in a separate list (if there are multiple equally good matches, then these are recorded as well). Let's say that your data.frame is called df:

l <- vector('list',nrow(df))
matches <- list(mother = l,father = l)
for(i in 1:nrow(df)){
  father_id <- with(df,which(student_name[i] == father_name))
  if(length(father_id) == 1){
    matches[['father']][[i]] <- father_id
  } else {
    old_father_id <- NULL
    ## try to find the total                                                                                                                                 
    for(m in 10:1){ ## m is the maximum distance                                                                                                             
      father_id <- with(df,agrep(student_name[i],father_name,max.dist = m))
      if(length(father_id) == 1 || m == 1){
        ## if we find a unique match or if we are in our last round, then stop                                                                               
        matches[['father']][[i]] <- father_id
        break
      } else if(length(father_id) == 0 && length(old_father_id) > 0) {
        ## if we can't do better than multiple matches, then record them anyway                                                                              
        matches[['father']][[i]] <- old_father_id
        break
      } else if(length(father_id) == 0 && length(old_father_id) == 0) {
        ## if the nearest match is more than 10 different from the current pattern, then stop                                                                
        break
      }
    }
  }
}

The code for the mother_name would be basically the same. You could even put them together in a loop, but this example is just for the purpose of illustration.

这篇关于通过R中变量的模糊匹配进行合并的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆