通过名称的模糊匹配来创建唯一的ID(通过使用R的agrep) [英] Create a unique ID by fuzzy matching of names (via agrep using R)

查看:198
本文介绍了通过名称的模糊匹配来创建唯一的ID(通过使用R的agrep)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用R,我尝试匹配由年份和城市构成的数据集中的人的名字.由于某些拼写错误,无法进行完全匹配,因此我尝试使用agrep()来模糊匹配名称.

Using R, I am trying match on people's names in a dataset structured by year and city. Due to some spelling mistakes, exact matching is not possible, so I am trying to use agrep() to fuzzy match names.

数据集的样本块的结构如下:

A sample chunk of the dataset is structured as follows:

df <- data.frame(matrix( c("1200013","1200013","1200013","1200013","1200013","1200013","1200013","1200013",                             "1996","1996","1996","1996","2000","2000","2004","2004","AGUSTINHO FORTUNATO FILHO","ANTONIO PEREIRA NETO","FERNANDO JOSE DA COSTA","PAULO CEZAR FERREIRA DE ARAUJO","PAULO CESAR FERREIRA DE ARAUJO","SEBASTIAO BOCALOM RODRIGUES","JOAO DE ALMEIDA","PAULO CESAR FERREIRA DE ARAUJO"), ncol=3,dimnames=list(seq(1:8),c("citycode","year","candidate")) ))

整齐的版本:

  citycode year                      candidate
1  1200013 1996      AGUSTINHO FORTUNATO FILHO
2  1200013 1996           ANTONIO PEREIRA NETO
3  1200013 1996         FERNANDO JOSE DA COSTA
4  1200013 1996 PAULO CEZAR FERREIRA DE ARAUJO
5  1200013 2000 PAULO CESAR FERREIRA DE ARAUJO
6  1200013 2000    SEBASTIAO BOCALOM RODRIGUES
7  1200013 2004                JOAO DE ALMEIDA
8  1200013 2004 PAULO CESAR FERREIRA DE ARAUJO

我想分别检查每个城市,几年后是否有候选人出现.例如.在示例中,

I'd like to check in each city separately, whether there are candidates appearing in several years. E.g. in the example,

PAULO CEZAR FERREIRA DE ARAUJO

PAULO CEZAR FERREIRA DE ARAUJO

PAULO CESAR FERREIRA DE ARAUJO

PAULO CESAR FERREIRA DE ARAUJO

出现两次(拼写错误).整个数据集中的每个候选项都应分配一个唯一的数字候选项ID.数据集相当大(5500个城市,大约10万个条目),因此稍微有效的编码会有所帮助.关于如何实施此建议?

appears twice (with a spelling mistake). Each candidate across the entire data set should be assigned a unique numeric candidate ID. The dataset is fairly large (5500 cities, approx. 100K entries) so a somewhat efficient coding would be helpful. Any suggestions as to how to implement this?

这是我的尝试(在到目前为止的评论帮助下)在完成手头的任务上非常缓慢(效率低下).关于此方面的改进有什么建议吗?

Here is my attempt (with help from the comments thus far) that is very slow (inefficient) in achieving the task at hand. Any suggestions as to improvements to this?

f <- function(x) {matches <- lapply(levels(x), agrep, x=levels(x),fixed=TRUE, value=FALSE)
                  levels(x) <- levels(x)[unlist(lapply(matches, function(x) x[1]))]
                  x
                }

temp <- tapply(df$candidate, df$citycode, f, simplify=TRUE)
df$candidatenew <- unlist(temp)
df$spellerror <- ifelse(as.character(df$candidate)==as.character(df$candidatenew), 0, 1)

现在以良好的速度运行.问题在于在每个步骤中都要与许多因素进行比较(感谢指出,Blue Magister).减少与一组(即城市)中的候选人的比较,可以在5秒钟内运行命令80,000行-这是我可以忍受的速度.

EDIT 2: Now running at good speed. Problem was the comparison to many factors at every step (Thanks for pointing that out, Blue Magister). Reducing the comparison to only the candidates in one group (i.e. a city) runs the command in 5 seconds for 80,000 lines - a speed I can live with.

df$candidate <- as.character(df$candidate)

f <- function(x) {x <- as.factor(x)
                  matches <- lapply(levels(x), agrep, x=levels(x),fixed=TRUE, value=FALSE)
                  levels(x) <- levels(x)[unlist(lapply(matches, function(x) x[1]))]
                  as.character(x)
                }

temp <- tapply(df$candidate, df$citycode, f, simplify=TRUE)
df$candidatenew <- unlist(temp)
df$spellerror <- ifelse(as.character(df$candidate)==as.character(df$candidatenew), 0, 1)

推荐答案

这是我的照片.这可能不是很有效,但是我认为它将完成工作.我认为df$candidates是分类因素.

Here's my shot at it. It's probably not very efficient, but I think it will get the job done. I assume that df$candidates is of class factor.

#fuzzy matches candidate names to other candidate names
#compares each pair of names only once
##by looking at names that have a greater index
matches <- unlist(lapply(1:(length(levels(df[["candidate"]]))-1),
    function(x) {max(x,x + agrep(
        pattern=levels(df[["candidate"]])[x], 
        x=levels(df[["candidate"]])[-seq_len(x)]
    ))}
))
#assigns new levels (omits the last level because that doesn't change)
levels(df[["candidate"]])[-length(levels(df[["candidate"]]))] <- 
    levels(df[["candidate"]])[matches]

这篇关于通过名称的模糊匹配来创建唯一的ID(通过使用R的agrep)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆