基于列值重复的函数的 R 函数 [英] R function for a function to be repeated based on column values

查看：26 发布时间：2022/1/13 19:33:34 r function split data.table

本文介绍了基于列值重复的函数的 R 函数的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个需要进行字符串匹配的大型数据集.我从这个站点获得了一些非常有用的帖子，并引用它们我创建了一个函数来为我的数据集进行字符串匹配.我正在粘贴我的示例数据和代码.

I have a large data set for which I need to do string matching. I have got some very useful posts from this site and referring them I have created a function to do the string matching for my dataset. I am pasting my sample data and code.

样本数据

Address1 <- c("786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr","23/4, 23RD FLOOR, STREET 2, ABC-E, PQR","45-B, GALI NO5, XYZ","HECTIC, 99 STREET, PQR")
AREACODE <- c('10','10','14','20','30')
Year1 <- c(2001:2005)

Address2 <- c("abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR","abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR")
Year2 <- c(2001:2010)
AREA_CODE <- c('10','10','10','20','30','40','50','61','64', '99')

data1 <- data.table(Address1, Year1, AREACODE)
data2 <- data.table(Address2, Year2, AREA_CODE)
data2[, unique_id := sprintf("%06d", 1:nrow(data2))]

代码

fn.fuzzymatch<-function(dat1,dat2,string1,string2,meth){

  dist.name<-stringdistmatrix(dat1[[string1]],dat2[[string2]],method = meth)

  min.name<-apply(dist.name, 1, min)

  match.s1.s2<-NULL
  for(i in 1:nrow(dist.name))
  {
    s2.i<-match(min.name[i],dist.name[i,])
    s1.i<-i
    match.s1.s2<-rbind(data.frame(s1_row=s1.i,s2_row=s2.i,s1name=dat1[s1.i,][[string1]],s2name=dat2[s2.i,][[string2]], dist=min.name[i]),match.s1.s2)
  }
  output <- (match.s1.s2)[order(match.s1.s2$s1_row),]
  return(output)
}


match_50 <- fn.fuzzymatch(data1,data2,"Address1","Address2","dl")

这适用于国家/地区级别的数据，但是我在区域级别有多个数据文件，并且每个区域都有多个区域.每个区域的区号可通过 data1 中的 AREACODE 变量和 data2 中的 AREA_CODE 变量获得.我想更新我的功能，以便

This is working fine for the data at country level, but then I have multiple data files at region level and each region is having multiple areas. Areacode for each region is available by the AREACODE variable in data1 and AREA_CODE variable in data2. I want to update my function so that

对每个区域进行字符串匹配，输出具有该区域代码
针对该区域中的所有区号合并的每个区域返回输出.

我试图使用 split 并将数据文件转换为列表并使用，然后使用 rbindlist 将它们组合但无法成功并且遇到了不同类型的错误.我确信有办法做到这一点，但无法做到.希望能给点建议.

I was trying to use split and to convert the data files into list and use and then use rbindlist to combine them but not able to succeed and have been getting different kinds of errors. I am sure there is a way to do this but not able to get it. Hope I can have some suggestions.

推荐答案

虽然您可能可以使用 apply 函数来重复不同区域的单独数据文件，但这里有一个基于我的 fuzzyjoin 解决方案回答您的上一个问题.

While you can probably use an apply function to repeat over separate data files of different regions, here is a fuzzyjoin solution based on my answer to your previous question.

它为 Address 寻找最佳的 stringdist 匹配，并且 AreaCode 必须完全匹配 (==).我还指定 year2 必须是 >= year1，只是为了演示.

It looks for the best stringdist match for Address and the AreaCode must match exactly (==). I also specified year2 had to be >= year1, just for demonstration.

最后，我使用 dplyr::group_by 和 dplyr::top_n 来获得最小距离匹配，我不得不假设在匹配关系中要做什么(挑选匹配最大的年份2).您还可以使用 slice_min 替换旧的 top_n，如果原始顺序很重要且不按字母顺序排列，请使用 mutate(rank = row_number(dist)) %>% filter(rank == 1)

Finally, I used dplyr::group_by and dplyr::top_n to get the minimum distance matches and I had to assume what to do in matching ties (picked matches with largest year2). You can also use slice_min which replaces the older top_n and if the original order is important and not alphabetical, use mutate(rank = row_number(dist)) %>% filter(rank == 1)

数据:

Address1 <- c("786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr","23/4, 23RD FLOOR, STREET 2, ABC-E, PQR","45-B, GALI NO5, XYZ","HECTIC, 99 STREET, PQR")
AREACODE <- c('10','10','14','20','30')
Year1 <- c(2001:2005)

Address2 <- c("abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR","abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR")
Year2 <- c(2001:2010)
AREA_CODE <- c('10','10','10','20','30','40','50','61','64', '99')

data1 <- data.table(Address1, Year1, AREACODE)
data2 <- data.table(Address2, Year2, AREA_CODE)
data2[, unique_id := sprintf("%06d", 1:nrow(data2))]

解决方案:

library(fuzzyjoin, quietly = TRUE); library(dplyr, quietly = TRUE)

# First, need to define match_fun_stringdist 
# Code from stringdist_join from https://github.com/dgrtwo/fuzzyjoin
match_fun_stringdist <- function(v1, v2) {
  
  # Can't pass these parameters in from fuzzy_join because of multiple incompatible match_funs, so I set them here.
  ignore_case = FALSE
  method = "dl"
  max_dist = 99
  distance_col = "dist"
  
  if (ignore_case) {
    v1 <- stringr::str_to_lower(v1)
    v2 <- stringr::str_to_lower(v2)
  }
  
  # shortcut for Levenshtein-like methods: if the difference in
  # string length is greater than the maximum string distance, the
  # edit distance must be at least that large
  
  # length is much faster to compute than string distance
  if (method %in% c("osa", "lv", "dl")) {
    length_diff <- abs(stringr::str_length(v1) - stringr::str_length(v2))
    include <- length_diff <= max_dist
    
    dists <- rep(NA, length(v1))
    
    dists[include] <- stringdist::stringdist(v1[include], v2[include], method = method)
  } else {
    # have to compute them all
    dists <- stringdist::stringdist(v1, v2, method = method)
  }
  ret <- dplyr::data_frame(include = (dists <= max_dist))
  if (!is.null(distance_col)) {
    ret[[distance_col]] <- dists
  }
  ret
}

# Finally, call fuzzy_join
fuzzy_join(data1, data2, 
           by = list(x = c("Address1", "AREACODE", "Year1"), y = c("Address2", "AREA_CODE", "Year2")), 
           match_fun = list(match_fun_stringdist, `==`, `<=`),
           mode = "left"
           ) %>%
  group_by(Address1, Year1, AREACODE) %>%
  top_n(1, -Address1.dist) %>%
  top_n(1, Year2) %>%
  select(unique_id, Address1.dist, everything())

这篇关于基于列值重复的函数的 R 函数的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

基于列值重复的函数的 R 函数 [英] R function for a function to be repeated based on column values

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

基于列值重复的函数的 R 函数 [英] R function for a function to be repeated based on column values

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭