R:记录链接问题,所有字段合并在1列中 [英] R : Record Linkage problem with all fields combined in 1 column

查看:96
本文介绍了R:记录链接问题,所有字段合并在1列中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须将数据集A中的列a匹配到数据集B中的列b. 但是不同的变量不是在单独的字段(a,b,c列)中,而是在同一个字段中.

I have to match column a from dataset A to column b in dataset B. But the different variables aren't in separate fields(columns a, b, c) but in the same one.

我一直在查看软件包 RecordLinkage & fastLink ,它们可以很好地将字段分隔开.

I have been looking at packages RecordLinkage & fastLink they work great with the fields being separated.

单独的字段:

# make dataframe 1
fname <- c("ash", "aalok", "aaron", "adam", "adrian", "ajay")
lname <- c("perry", "phillips", "picardo", "pinck", "pinnick-flood", "pledger")
dob <- c(1957, 1971, 1948, 1961, 1972, 2000)
city <- c("Oakland", "Piedmont", "Pleasanton", "San Leandro", "San Lorenzo", "Melbourne")
street <- c(" 100th ave", " 107th ave", " 10th ave", " 159th ave", " 165th ave apt 112", " 167th ave")    

# make dataframe 2
fname2 <- c("ashley", "aaloknath", "aron", "adam", "adrian", "ajaay")
lname2 <- c("perry", "philips", "picardo", "pinnck", "pinnick flood", "pleedger")
dob2 <- c(1950, 1971, 1948, 1900, 1972, 2000)
city2 <- c("Oakland City", "Piedmont", "Pleasanton city", "San Leandro", "San Lorenzo", "Melbourne")
street2 <- c(" 100 ave", " 107th ave", " 100 ave", " 159th ave", " 1652 ave apt 112", " 167th")


df1 <- data.frame(fname, lname, dob, city, street)
df2 <- data.frame(fname2, lname2, dob2, city2, street2)

# change order of rows
df2 <- df2[c(6, 3, 2, 4, 5, 1), ]

# columns must have same name
names(df2) <- names(df1)

fastLink示例

library(stringdist)
library(Rcpp)
library(fastLink)

matches.out <- fastLink(
  dfA = df1,
  dfB = df2,
  varnames = c("fname", "lname", "dob", "city", "street"),
  stringdist.match = c("fname", "lname", "city", "street"),
  numeric.match = "dob"
)


> matches.out$matches
  inds.a inds.b
1      6      1
2      3      2
3      2      3
4      4      4
5      5      5
6      1      6

RecordLinkage示例

a <- compare.linkage(df1, df2)

# Calculate M and U weights using the EM algorithm
b <- emWeights(a, cutoff = 0.8)
summary(b)

allPairs <- getPairs(b)
head(allPairs)

# Determine two thresholds
finalPairs <- getPairs(b, max.weight = 100, min.weight = 0)
head(finalPairs)

  id  fname         lname  dob        city             street             Weight
1  4   adam         pinck 1961 San Leandro          159th ave                   
2  4   adam        pinnck 1900 San Leandro          159th ave  30.35517990295238
3                                                                               
4  5 adrian pinnick-flood 1972 San Lorenzo  165th ave apt 112                   
5  5 adrian pinnick flood 1972 San Lorenzo   1652 ave apt 112  24.99000510744970

合并字段的问题:

matchA <- c("ash perry 1957 Oakland 100th ave", "aalok  1971 phillips Piedmont 107th ave", "aaron picardo Pleasanton 1948 10th ave")
df3 <- data.frame(matchA)

matchB <- c("1950 picard aron Pleasanton City 10 ave", "aalok   philips Piedmont 1971 107th ave", "ashley perry Oakland City 1950 100 ave")
df4 <- data.frame(matchB)

我希望无论在同一个字段中,记录都将被匹配&不论名称,城市和标志的顺序如何.

I expect the records to be matched regardless being in the same field & regardless of the order of names, city and dob.

推荐答案

不确定您的意思.即使使用组合列",我也得到类似的结果

Not sure what you mean. I get similar results even with 'combined columns'

library(fastLink)
data(samplematch)

fastLink::fastLink(dfA = dfA, dfB = dfB, 
                   varnames = colnames(dfA),
                   return.all = TRUE) -> out

# concatenate all columns
apply(dfA, 1, paste, collapse = " ")-> dfAall
dfA$all <- dfAall
apply(dfB, 1, paste, collapse = " ")-> dfBall
dfB$all <- dfBall

# run on concatenated column
fastLink::fastLink(dfA = dfA, dfB = dfB, 
                   varnames = c('all'),
                   return.all = TRUE) -> out2

# compare results
out$matches == out2$matches

这篇关于R:记录链接问题,所有字段合并在1列中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆