两个数据库的模糊匹配和精确匹配 [英] fuzzy and exact match of two databases
问题描述
我有两个数据库.第一个有大约 70k 行和 3 列.第二个有 790k 行和 2 列.两个数据库都有一个公共变量grantee_name
.我想根据此 grantee_name
将第一个数据库的每一行与第二个数据库的一行或多行进行匹配.请注意,merge
将不起作用,因为 grantee_name
不完全匹配.有不同的拼写等.所以,我正在使用 fuzzyjoin
包并尝试以下操作:
I have two databases. The first one has about 70k rows with 3 columns. the second one has 790k rows with 2 columns. Both databases have a common variable grantee_name
. I want to match each row of the first database to one or more rows of the second database based on this grantee_name
. Note that merge
will not work because the grantee_name
do not match perfectly. There are different spellings etc. So, I am using the fuzzyjoin
package and trying the following:
library("haven"); library("fuzzyjoin"); library("dplyr")
forfuzzy<-read_dta("/path/forfuzzy.dta")
filings <- read_dta ("/path/filings.dta")
> head(forfuzzy)
# A tibble: 6 x 3
grantee_name grantee_city grantee_state
<chr> <chr> <chr>
1 (ICS)2 MAINE CHAPTER CLEARWATER FL
2 (SUFFOLK COUNTY) VANDERBILT~ CENTERPORT NY
3 1 VOICE TREKKING A FUND OF ~ WESTMINSTER MD
4 10 CAN NEWBERRY FL
5 10 THOUSAND WINDOWS LIVERMORE CA
6 100 BLACK MEN IN CHICAGO INC CHICAGO IL
... 7 - 70000 rows to go
> head(filings)
# A tibble: 6 x 2
grantee_name ein
<chr> <dbl>
1 ICS-2 MAINE CHAPTER 123456
2 SUFFOLK COUNTY VANDERBILT 654321
3 VOICE TREKKING A FUND OF VOICES 789456
4 10 CAN 654987
5 10 THOUSAND MUSKETEERS INC 789123
6 100 BLACK MEN IN HOUSTON INC 987321
rows 7-790000 omitted for brevity
上面的例子已经足够清楚,可以提供一些好的匹配和一些不太好的匹配.请注意,例如,10 THOUSAND WINDOWS
与 10 THOUSAND MUSKETEERS INC
最匹配,但这并不意味着它是一个很好的匹配.filings
数据中的某处会有更好的匹配(上面未显示).在这个阶段这无关紧要.
The above examples are clear enough to provide some good matches and some not-so-good matches. Note that, for example, 10 THOUSAND WINDOWS
will match best with 10 THOUSAND MUSKETEERS INC
but it does not mean it is a good match. There will be a better match somewhere in the filings
data (not shown above). That does not matter at this stage.
所以,我尝试了以下方法:
So, I have tried the following:
df<-as.data.frame(stringdist_inner_join(forfuzzy, filings, by="grantee_name", method="jw", p=0.1, max_dist=0.1, distance_col="distance"))
对 R 完全陌生.这会导致错误:不能分配大小为 375GB 的向量
(当然是大数据库).来自 forfuzzy
的 100 行样本始终有效.所以,我想到了一次遍历 100 行的列表.
Totally new to R. This is resulting in an error:
cannot allocate vector of size 375GB
(with the big database of course). A sample of 100 rows from forfuzzy
always works. So, I thought of iterating over a list of 100 rows at a time.
我尝试了以下方法:
n=100
lst = split(forfuzzy, cumsum((1:nrow(forfuzzy)-1)%%n==0))
df<-as.data.frame(lapply(lst, function(df_)
{
(stringdist_inner_join(df_, filings, by="grantee_name", method="jw", p=0.1, max_dist=0.1, distance_col="distance", nthread = getOption("sd_num_thread")))
}
)%>% bind_rows)
我也用 mclapply
而不是 lapply
尝试了上述方法.即使我尝试将高性能集群设置为 3 个 CPU,每个 CPU 具有 480G 内存并使用带有 mc.cores=3
选项的 mclapply
,也会发生同样的错误.也许 foreach
命令会有所帮助,但我不知道如何实现它.
I have also tried the above with mclapply
instead of lapply
. Same error happens even though I have tried a high-performance cluster setting 3 CPUs, each with 480G of memory and using mclapply
with the option mc.cores=3
. Perhaps a foreach
command could help, but I have no idea how to implement it.
有人建议我使用 purrr
和 repurrrsive
包,所以我尝试以下操作:
I have been advised to use the purrr
and repurrrsive
packages, so I try the following:
purrr::map(lst, ~stringdist_inner_join(., filings, by="grantee_name", method="jw", p=0.1, max_dist=0.1, distance_col="distance", nthread = getOption("sd_num_thread")))
在 by=grantee_name
语句中出现新手错误后,这似乎有效.但是,它需要永远,我不确定它会起作用.forfuzzy
中的一个示例列表有 100 行,n=10
(所以 10 个列表,每个列表 10 行)已经运行了 50 分钟,但仍然没有结果.>
This seems to be working, after a novice error in the by=grantee_name
statement. However, it is taking forever and I am not sure it will work. A sample list in forfuzzy
of 100 rows, with n=10
(so 10 lists with 10 rows each) has been running for 50 minutes, and still no results.
推荐答案
如果您拆分(使用 base::split
或 dplyr::group_split
)您的 uniquegrantees 数据框进入数据框列表,然后您可以在列表上调用 purrr::map
.(map
几乎是 lapply
)
If you split (with base::split
or dplyr::group_split
) your uniquegrantees data frame into a list of data frames, then you can call purrr::map
on the list. (map
is pretty much lapply
)
purrr::map(list_of_dfs, ~stringdist_inner_join(.,filings, by=grantee_name", method=jw", p=0.1, max_dist=0.1, distance_col=distance"))
您的结果将是一个数据框列表,每个数据框都与文件模糊连接.然后,您可以调用 bind_rows
(或者您可以执行 map_dfr
)以再次获取同一数据框中的所有结果.
Your result will be a list of data frames each fuzzyjoined with filings. You can then call bind_rows
(or you could do map_dfr
) to get all the results in the same data frame again.
参见 R -将一个大数据帧拆分为几个较小的数据帧,对每个数据帧执行模糊连接并输出到单个数据帧
这篇关于两个数据库的模糊匹配和精确匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!