两个数据库的模糊匹配和精确匹配 [英] fuzzy and exact match of two databases

查看：55 发布时间：2021/6/23 19:07:12 r purrr parallel.foreach fuzzyjoin furrr

本文介绍了两个数据库的模糊匹配和精确匹配的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有两个数据库.第一个有大约 70k 行和 3 列.第二个有 790k 行和 2 列.两个数据库都有一个公共变量grantee_name.我想根据此 grantee_name 将第一个数据库的每一行与第二个数据库的一行或多行进行匹配.请注意，merge 将不起作用，因为 grantee_name 不完全匹配.有不同的拼写等.所以，我正在使用 fuzzyjoin 包并尝试以下操作:

I have two databases. The first one has about 70k rows with 3 columns. the second one has 790k rows with 2 columns. Both databases have a common variable grantee_name. I want to match each row of the first database to one or more rows of the second database based on this grantee_name. Note that merge will not work because the grantee_name do not match perfectly. There are different spellings etc. So, I am using the fuzzyjoin package and trying the following:

library("haven"); library("fuzzyjoin"); library("dplyr")
forfuzzy<-read_dta("/path/forfuzzy.dta")
filings <- read_dta ("/path/filings.dta")
> head(forfuzzy)
# A tibble: 6 x 3
  grantee_name                 grantee_city grantee_state
  <chr>                        <chr>        <chr>        
1 (ICS)2 MAINE CHAPTER         CLEARWATER   FL           
2 (SUFFOLK COUNTY) VANDERBILT~ CENTERPORT   NY           
3 1 VOICE TREKKING A FUND OF ~ WESTMINSTER  MD           
4 10 CAN                       NEWBERRY     FL           
5 10 THOUSAND WINDOWS          LIVERMORE    CA           
6 100 BLACK MEN IN CHICAGO INC CHICAGO      IL   
... 7 - 70000 rows to go

> head(filings)
# A tibble: 6 x 2
  grantee_name                       ein 
  <chr>                             <dbl>               
1 ICS-2 MAINE CHAPTER              123456             
2 SUFFOLK COUNTY VANDERBILT        654321            
3 VOICE TREKKING A FUND OF VOICES  789456            
4 10 CAN                           654987               
5 10 THOUSAND MUSKETEERS INC       789123               
6 100 BLACK MEN IN HOUSTON INC     987321      

rows 7-790000 omitted for brevity

上面的例子已经足够清楚，可以提供一些好的匹配和一些不太好的匹配.请注意，例如，10 THOUSAND WINDOWS 与 10 THOUSAND MUSKETEERS INC 最匹配，但这并不意味着它是一个很好的匹配.filings 数据中的某处会有更好的匹配(上面未显示).在这个阶段这无关紧要.

The above examples are clear enough to provide some good matches and some not-so-good matches. Note that, for example, 10 THOUSAND WINDOWS will match best with 10 THOUSAND MUSKETEERS INC but it does not mean it is a good match. There will be a better match somewhere in the filings data (not shown above). That does not matter at this stage.

所以，我尝试了以下方法:

So, I have tried the following:

df<-as.data.frame(stringdist_inner_join(forfuzzy, filings, by="grantee_name", method="jw", p=0.1, max_dist=0.1, distance_col="distance"))

对 R 完全陌生.这会导致错误:不能分配大小为 375GB 的向量(当然是大数据库).来自 forfuzzy 的 100 行样本始终有效.所以，我想到了一次遍历 100 行的列表.

Totally new to R. This is resulting in an error: cannot allocate vector of size 375GB (with the big database of course). A sample of 100 rows from forfuzzy always works. So, I thought of iterating over a list of 100 rows at a time.

我尝试了以下方法:

n=100
lst = split(forfuzzy, cumsum((1:nrow(forfuzzy)-1)%%n==0))

df<-as.data.frame(lapply(lst, function(df_)
{
(stringdist_inner_join(df_, filings, by="grantee_name", method="jw", p=0.1, max_dist=0.1, distance_col="distance", nthread = getOption("sd_num_thread")))
}
)%>% bind_rows)

我也用 mclapply 而不是 lapply 尝试了上述方法.即使我尝试将高性能集群设置为 3 个 CPU，每个 CPU 具有 480G 内存并使用带有 mc.cores=3 选项的 mclapply，也会发生同样的错误.也许 foreach 命令会有所帮助，但我不知道如何实现它.

I have also tried the above with mclapply instead of lapply. Same error happens even though I have tried a high-performance cluster setting 3 CPUs, each with 480G of memory and using mclapply with the option mc.cores=3. Perhaps a foreach command could help, but I have no idea how to implement it.

有人建议我使用 purrr 和 repurrrsive 包，所以我尝试以下操作:

I have been advised to use the purrr and repurrrsive packages, so I try the following:

purrr::map(lst, ~stringdist_inner_join(., filings, by="grantee_name", method="jw", p=0.1, max_dist=0.1, distance_col="distance", nthread = getOption("sd_num_thread")))

在 by=grantee_name 语句中出现新手错误后，这似乎有效.但是，它需要永远，我不确定它会起作用.forfuzzy 中的一个示例列表有 100 行，n=10(所以 10 个列表，每个列表 10 行)已经运行了 50 分钟，但仍然没有结果.>

This seems to be working, after a novice error in the by=grantee_name statement. However, it is taking forever and I am not sure it will work. A sample list in forfuzzy of 100 rows, with n=10 (so 10 lists with 10 rows each) has been running for 50 minutes, and still no results.

两个数据库的模糊匹配和精确匹配 [英] fuzzy and exact match of two databases

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

两个数据库的模糊匹配和精确匹配 [英] fuzzy and exact match of two databases

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭