通过R中变量的模糊匹配进行合并 [英] Merging through fuzzy matching of variables in R
问题描述
我有两个数据帧(x和y),其中ID为student_name
,father_name
和mother_name
.由于存在印刷错误("n"而不是"m",随机的空格等),尽管我可以查看数据并看到应有的值,但我仍有大约60%的值未对齐.有没有办法以某种方式减少不匹配的程度,以便至少由于可行而手动编辑?数据帧有大约70万个观测值.
R最好.我知道一些python和一些基本的unix工具. P.S.我阅读了agrep()
,但不了解它如何在实际数据集中起作用,尤其是当匹配项超过一个以上变量时.
更新(发布赏金的数据):
此处是两个示例数据帧,分别为sites_a
和sites_b
.它们可以在数字列lat
和lon
以及sitename
列上匹配.了解如何在a)仅lat
+ lon
,b)sitename
或c)两者上完成此操作将很有用.
您可以获取文件 test_sites.R .作为要点发布.
理想情况下,答案以
结尾merge(sites_a, sites_b, by = **magic**)
R would be best. I know a little bit of python, and some basic unix tools. P.S. I read up on update (data for posted bounty): Here are two example data frames, you can source the file test_sites.R which is posted as a gist. Ideally the answer would end with
The The code for the 这篇关于通过R中变量的模糊匹配进行合并的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!agrep
函数(基础R的一部分),该函数使用
agrep()
, but don't understand how that can work on actual datasets, especially when the match is over more than one variable.
sites_a
and sites_b
. They could be matched on the numeric columns lat
and lon
as well as on the sitename
column. It would be useful to know how this could be done on a) just lat
+ lon
, b) sitename
or c) both.merge(sites_a, sites_b, by = **magic**)
agrep
function (part of base R), which does approximate string matching using the Levenshtein edit distance is probably worth trying. Without knowing what your data looks like, I can't really suggest a working solution. But this is a suggestion... It records matches in a separate list (if there are multiple equally good matches, then these are recorded as well). Let's say that your data.frame is called df
:l <- vector('list',nrow(df))
matches <- list(mother = l,father = l)
for(i in 1:nrow(df)){
father_id <- with(df,which(student_name[i] == father_name))
if(length(father_id) == 1){
matches[['father']][[i]] <- father_id
} else {
old_father_id <- NULL
## try to find the total
for(m in 10:1){ ## m is the maximum distance
father_id <- with(df,agrep(student_name[i],father_name,max.dist = m))
if(length(father_id) == 1 || m == 1){
## if we find a unique match or if we are in our last round, then stop
matches[['father']][[i]] <- father_id
break
} else if(length(father_id) == 0 && length(old_father_id) > 0) {
## if we can't do better than multiple matches, then record them anyway
matches[['father']][[i]] <- old_father_id
break
} else if(length(father_id) == 0 && length(old_father_id) == 0) {
## if the nearest match is more than 10 different from the current pattern, then stop
break
}
}
}
}
mother_name
would be basically the same. You could even put them together in a loop, but this example is just for the purpose of illustration.