基于列值重复的函数的 R 函数 [英] R function for a function to be repeated based on column values
问题描述
我有一个需要进行字符串匹配的大型数据集.我从这个站点获得了一些非常有用的帖子,并引用它们我创建了一个函数来为我的数据集进行字符串匹配.我正在粘贴我的示例数据和代码.
I have a large data set for which I need to do string matching. I have got some very useful posts from this site and referring them I have created a function to do the string matching for my dataset. I am pasting my sample data and code.
样本数据
Address1 <- c("786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr","23/4, 23RD FLOOR, STREET 2, ABC-E, PQR","45-B, GALI NO5, XYZ","HECTIC, 99 STREET, PQR")
AREACODE <- c('10','10','14','20','30')
Year1 <- c(2001:2005)
Address2 <- c("abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR","abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR")
Year2 <- c(2001:2010)
AREA_CODE <- c('10','10','10','20','30','40','50','61','64', '99')
data1 <- data.table(Address1, Year1, AREACODE)
data2 <- data.table(Address2, Year2, AREA_CODE)
data2[, unique_id := sprintf("%06d", 1:nrow(data2))]
代码
fn.fuzzymatch<-function(dat1,dat2,string1,string2,meth){
dist.name<-stringdistmatrix(dat1[[string1]],dat2[[string2]],method = meth)
min.name<-apply(dist.name, 1, min)
match.s1.s2<-NULL
for(i in 1:nrow(dist.name))
{
s2.i<-match(min.name[i],dist.name[i,])
s1.i<-i
match.s1.s2<-rbind(data.frame(s1_row=s1.i,s2_row=s2.i,s1name=dat1[s1.i,][[string1]],s2name=dat2[s2.i,][[string2]], dist=min.name[i]),match.s1.s2)
}
output <- (match.s1.s2)[order(match.s1.s2$s1_row),]
return(output)
}
match_50 <- fn.fuzzymatch(data1,data2,"Address1","Address2","dl")
这适用于国家/地区级别的数据,但是我在区域级别有多个数据文件,并且每个区域都有多个区域.每个区域的区号可通过 data1 中的 AREACODE 变量和 data2 中的 AREA_CODE 变量获得.我想更新我的功能,以便
This is working fine for the data at country level, but then I have multiple data files at region level and each region is having multiple areas. Areacode for each region is available by the AREACODE variable in data1 and AREA_CODE variable in data2. I want to update my function so that
- 对每个区域进行字符串匹配,输出具有该区域代码
- 针对该区域中的所有区号合并的每个区域返回输出.
我试图使用 split 并将数据文件转换为列表并使用,然后使用 rbindlist 将它们组合但无法成功并且遇到了不同类型的错误.我确信有办法做到这一点,但无法做到.希望能给点建议.
I was trying to use split and to convert the data files into list and use and then use rbindlist to combine them but not able to succeed and have been getting different kinds of errors. I am sure there is a way to do this but not able to get it. Hope I can have some suggestions.
推荐答案
虽然您可能可以使用 apply 函数来重复不同区域的单独数据文件,但这里有一个基于我的 fuzzyjoin
解决方案回答您的 上一个问题.
While you can probably use an apply function to repeat over separate data files of different regions, here is a fuzzyjoin
solution based on my answer to your previous question.
它为 Address 寻找最佳的 stringdist
匹配,并且 AreaCode 必须完全匹配 (==
).我还指定 year2 必须是 >=
year1,只是为了演示.
It looks for the best stringdist
match for Address and the AreaCode must match exactly (==
). I also specified year2 had to be >=
year1, just for demonstration.
最后,我使用 dplyr::group_by
和 dplyr::top_n
来获得最小距离匹配,我不得不假设在匹配关系中要做什么(挑选匹配最大的年份2).您还可以使用 slice_min
替换旧的 top_n
,如果原始顺序很重要且不按字母顺序排列,请使用 mutate(rank = row_number(dist)) %>% filter(rank == 1)
Finally, I used dplyr::group_by
and dplyr::top_n
to get the minimum distance matches and I had to assume what to do in matching ties (picked matches with largest year2). You can also use slice_min
which replaces the older top_n
and if the original order is important and not alphabetical, use mutate(rank = row_number(dist)) %>% filter(rank == 1)
数据:
Address1 <- c("786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr","23/4, 23RD FLOOR, STREET 2, ABC-E, PQR","45-B, GALI NO5, XYZ","HECTIC, 99 STREET, PQR")
AREACODE <- c('10','10','14','20','30')
Year1 <- c(2001:2005)
Address2 <- c("abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR","abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR")
Year2 <- c(2001:2010)
AREA_CODE <- c('10','10','10','20','30','40','50','61','64', '99')
data1 <- data.table(Address1, Year1, AREACODE)
data2 <- data.table(Address2, Year2, AREA_CODE)
data2[, unique_id := sprintf("%06d", 1:nrow(data2))]
解决方案:
library(fuzzyjoin, quietly = TRUE); library(dplyr, quietly = TRUE)
# First, need to define match_fun_stringdist
# Code from stringdist_join from https://github.com/dgrtwo/fuzzyjoin
match_fun_stringdist <- function(v1, v2) {
# Can't pass these parameters in from fuzzy_join because of multiple incompatible match_funs, so I set them here.
ignore_case = FALSE
method = "dl"
max_dist = 99
distance_col = "dist"
if (ignore_case) {
v1 <- stringr::str_to_lower(v1)
v2 <- stringr::str_to_lower(v2)
}
# shortcut for Levenshtein-like methods: if the difference in
# string length is greater than the maximum string distance, the
# edit distance must be at least that large
# length is much faster to compute than string distance
if (method %in% c("osa", "lv", "dl")) {
length_diff <- abs(stringr::str_length(v1) - stringr::str_length(v2))
include <- length_diff <= max_dist
dists <- rep(NA, length(v1))
dists[include] <- stringdist::stringdist(v1[include], v2[include], method = method)
} else {
# have to compute them all
dists <- stringdist::stringdist(v1, v2, method = method)
}
ret <- dplyr::data_frame(include = (dists <= max_dist))
if (!is.null(distance_col)) {
ret[[distance_col]] <- dists
}
ret
}
# Finally, call fuzzy_join
fuzzy_join(data1, data2,
by = list(x = c("Address1", "AREACODE", "Year1"), y = c("Address2", "AREA_CODE", "Year2")),
match_fun = list(match_fun_stringdist, `==`, `<=`),
mode = "left"
) %>%
group_by(Address1, Year1, AREACODE) %>%
top_n(1, -Address1.dist) %>%
top_n(1, Year2) %>%
select(unique_id, Address1.dist, everything())
这篇关于基于列值重复的函数的 R 函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!