R中的模糊外部联接/合并 [英] fuzzy outer join/merge in R
问题描述
我有2个数据集,想要进行模糊连接.
这是两个数据集.
I have 2 datasets and want to do fuzzy join.
Here is the two datasets.
library(data.table)
# data1
dt1 <- fread("NAME State type
ABERCOMBIE TOWNSHIP ND TS
ABERDEEN TOWNSHIP NJ TS
ABERDEEN TOWNSHIP SD TS
ABBOTSFORD CITY WI CI
ABERDEEN CITY WA CI
ADA TOWNSHIP MI TS
ADAMS IL TS", header = T)
# data2
dt2 <- fread("NAME State type
ABERDEEN TWP N J NJ TS
ABERDEEN WASH WA CI
ABBOTSFORD WIS WI CI
ADA TWP MICH MI TS
ADA OHIO OH CI
ADAMS MASS MA CI
ADAMSVILLE ALA AL CI", header = T)
两个数据集在State
和type
中具有相同的字符;但是,列NAME
不相同.它们是相似的.
尽管我可以用3或4个宪章减去每个数据上的NAME
列,然后将它们合并,但由于观察到的大量数据,似乎正确的比率可能不高.
Two datasets have the same characters in State
and type
; however, columns NAME
are not the same. They are similar.
Although I can subtract columns NAME
on each data with 3 or 4 charters and then merge them, it seems that the correct ratio may not high due to large observations.
dt1$NameSubstr <- substr(dt1$NAME, 1, 4)
dt2$NameSubstr <- substr(dt2$NAME, 1, 4)
merge(dt1, dt2, by = c("NameSubstr", "State", "type"), all = T)
方法不好.
我检查软件包fuzzyjoin
.但不确定我是否正确.
I check for package fuzzyjoin
. But not sure whether I am correct or not.
library(fuzzyjoin)
fuzzy_full_join(dt1, dt2, by = c("NAME" = "NAME", "State" = "State", "type" = "type"), match_fun = list(`!=`, `==`, `==`))
# Results
NAME.x State.x type.x NAME.y State.y type.y
1: ABERDEEN TOWNSHIP NJ TS ABERDEEN TWP N J NJ TS
2: ABBOTSFORD CITY WI CI ABBOTSFORD WIS WI CI
3: ABERDEEN CITY WA CI ABERDEEN WASH WA CI
4: ADA TOWNSHIP MI TS ADA TWP MICH MI TS
5: ABERCOMBIE TOWNSHIP ND TS <NA> <NA> <NA>
6: ABERDEEN TOWNSHIP SD TS <NA> <NA> <NA>
7: ADAMS IL TS <NA> <NA> <NA>
8: <NA> <NA> <NA> ADA OHIO OH CI
9: <NA> <NA> <NA> ADAMS MASS MA CI
10: <NA> <NA> <NA> ADAMSVILLE ALA AL CI
此练习中的结果是正确的,请参见下文.但是,如果这两个数据中的任何NAME相同,答案将不正确.
我在这两个数据中创建了一个新观察值.
The result in this exercise is correct, see below. But If any NAME in these two data are the same, the answer will not correct.
I create a new observation in these two data.
dt1 <- fread("NAME State type
ABERCOMBIE TOWNSHIP ND TS
ABERDEEN TOWNSHIP NJ TS
ABERDEEN TOWNSHIP SD TS
ABBOTSFORD CITY WI CI
ABERDEEN CITY WA CI
ADA TOWNSHIP MI TS
ADAMS IL TS
THE SAME AA BB
", header = T)
dt2 <- fread("NAME State type
ABERDEEN TWP N J NJ TS
ABERDEEN WASH WA CI
ABBOTSFORD WIS WI CI
ADA TWP MICH MI TS
ADA OHIO OH CI
ADAMS MASS MA CI
ADAMSVILLE ALA AL CI
THE SAME AA BB
", header = T)
fuzzy_full_join(dt1, dt2, by = c("NAME" = "NAME", "State" = "State", "type" = "type"), match_fun = list(`!=`, `==`, `==`))
NAME.x State.x type.x NAME.y State.y type.y
1: ABERDEEN TOWNSHIP NJ TS ABERDEEN TWP N J NJ TS
2: ABBOTSFORD CITY WI CI ABBOTSFORD WIS WI CI
3: ABERDEEN CITY WA CI ABERDEEN WASH WA CI
4: ADA TOWNSHIP MI TS ADA TWP MICH MI TS
5: ABERCOMBIE TOWNSHIP ND TS <NA> <NA> <NA>
6: ABERDEEN TOWNSHIP SD TS <NA> <NA> <NA>
7: ADAMS IL TS <NA> <NA> <NA>
8: THE SAME AA BB <NA> <NA> <NA>
9: <NA> <NA> <NA> ADA OHIO OH CI
10: <NA> <NA> <NA> ADAMS MASS MA CI
11: <NA> <NA> <NA> ADAMSVILLE ALA AL CI
12: <NA> <NA> <NA> THE SAME AA BB
这是不正确的结果. 有什么建议吗?
This is incorrect result. Any suggestion?
似乎我不能使用fuzzy_full_join
.
推荐答案
这是因为您要求Fuzzy_full_join为您提供不匹配的名称(带有!=),然后声明确实匹配的类型和名称(带有== == ).因此,如果所有三个都匹配,它将不会显示.
It is because you asked fuzzy_full_join to give you NAMES that did not match (with !=) and then state and types that did match (with == ==). So if in the case all three do match, it won't show up.
您可以使用以下命令运行两次:
You could run it twice with:
match_fun = list(`!=`, `==`, `==`))
match_fun = list(`==`, `==`, `==`))
library(data.table); library(fuzzyjoin)
#> Warning: package 'data.table' was built under R version 3.5.2
dt1 <- fread("NAME State type
ABERCOMBIETOWNSHIP ND TS
ABERDEENTOWNSHIP NJ TS
ABERDEENTOWNSHIP SD TS
ABBOTSFORDCITY WI CI
ABERDEENCITY WA CI
ADATOWNSHIP MI TS
ADAMS IL TS
THESAME AA BB
", header = T)
dt2 <- fread("NAME State type
ABERDEENTWPNJ NJ TS
ABERDEENWASH WA CI
ABBOTSFORDWIS WI CI
ADATWPMICH MI TS
ADAOHIO OH CI
ADAMSMASS MA CI
ADAMSVILLEALA AL CI
THESAME AA BB
", header = T)
fuzzy_full_join(dt1, dt2, by = c("NAME" = "NAME", "State" = "State", "type" = "type"), match_fun = list(`!=`, `==`, `==`))
#> NAME.x State.x type.x NAME.y State.y type.y
#> 1: ABERDEENTOWNSHIP NJ TS ABERDEENTWPNJ NJ TS
#> 2: ABBOTSFORDCITY WI CI ABBOTSFORDWIS WI CI
#> 3: ABERDEENCITY WA CI ABERDEENWASH WA CI
#> 4: ADATOWNSHIP MI TS ADATWPMICH MI TS
#> 5: ABERCOMBIETOWNSHIP ND TS <NA> <NA> <NA>
#> 6: ABERDEENTOWNSHIP SD TS <NA> <NA> <NA>
#> 7: ADAMS IL TS <NA> <NA> <NA>
#> 8: THESAME AA BB <NA> <NA> <NA>
#> 9: <NA> <NA> <NA> ADAOHIO OH CI
#> 10: <NA> <NA> <NA> ADAMSMASS MA CI
#> 11: <NA> <NA> <NA> ADAMSVILLEALA AL CI
#> 12: <NA> <NA> <NA> THESAME AA BB
fuzzy_full_join(dt1, dt2, by = c("NAME" = "NAME", "State" = "State", "type" = "type"), match_fun = list(`==`, `==`, `==`))
#> NAME.x State.x type.x NAME.y State.y type.y
#> 1: THESAME AA BB THESAME AA BB
#> 2: ABERCOMBIETOWNSHIP ND TS <NA> <NA> <NA>
#> 3: ABERDEENTOWNSHIP NJ TS <NA> <NA> <NA>
#> 4: ABERDEENTOWNSHIP SD TS <NA> <NA> <NA>
#> 5: ABBOTSFORDCITY WI CI <NA> <NA> <NA>
#> 6: ABERDEENCITY WA CI <NA> <NA> <NA>
#> 7: ADATOWNSHIP MI TS <NA> <NA> <NA>
#> 8: ADAMS IL TS <NA> <NA> <NA>
#> 9: <NA> <NA> <NA> ABERDEENTWPNJ NJ TS
#> 10: <NA> <NA> <NA> ABERDEENWASH WA CI
#> 11: <NA> <NA> <NA> ABBOTSFORDWIS WI CI
#> 12: <NA> <NA> <NA> ADATWPMICH MI TS
#> 13: <NA> <NA> <NA> ADAOHIO OH CI
#> 14: <NA> <NA> <NA> ADAMSMASS MA CI
#> 15: <NA> <NA> <NA> ADAMSVILLEALA AL CI
由 reprex软件包(v0.2.1)于2019-03-17创建
这篇关于R中的模糊外部联接/合并的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!