R中的模糊外部联接/合并 [英] fuzzy outer join/merge in R

查看:162
本文介绍了R中的模糊外部联接/合并的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有2个数据集,想要进行模糊连接.
这是两个数据集.

I have 2 datasets and want to do fuzzy join.
Here is the two datasets.

library(data.table)
# data1
dt1 <- fread("NAME                State type
              ABERCOMBIE TOWNSHIP   ND  TS
              ABERDEEN TOWNSHIP     NJ  TS
              ABERDEEN TOWNSHIP     SD  TS
              ABBOTSFORD CITY       WI  CI
              ABERDEEN CITY         WA  CI
              ADA TOWNSHIP          MI  TS
              ADAMS                 IL  TS", header = T)
# data2
dt2 <- fread("NAME               State  type
              ABERDEEN TWP N J    NJ    TS
              ABERDEEN WASH       WA    CI
              ABBOTSFORD WIS      WI    CI
              ADA TWP MICH        MI    TS
              ADA OHIO            OH    CI
              ADAMS MASS          MA    CI
              ADAMSVILLE ALA      AL    CI", header = T)

两个数据集在Statetype中具有相同的字符;但是,列NAME不相同.它们是相似的.
尽管我可以用3或4个宪章减去每个数据上的NAME列,然后将它们合并,但由于观察到的大量数据,似乎正确的比率可能不高.

Two datasets have the same characters in State and type; however, columns NAME are not the same. They are similar.
Although I can subtract columns NAME on each data with 3 or 4 charters and then merge them, it seems that the correct ratio may not high due to large observations.

dt1$NameSubstr <- substr(dt1$NAME, 1, 4)
dt2$NameSubstr <- substr(dt2$NAME, 1, 4)
merge(dt1, dt2, by = c("NameSubstr", "State", "type"), all = T)

方法不好.

我检查软件包fuzzyjoin.但不确定我是否正确.

I check for package fuzzyjoin. But not sure whether I am correct or not.

library(fuzzyjoin)
fuzzy_full_join(dt1, dt2, by = c("NAME" = "NAME", "State" = "State", "type" = "type"), match_fun = list(`!=`, `==`, `==`))  

# Results
                 NAME.x State.x type.x           NAME.y State.y type.y
 1:   ABERDEEN TOWNSHIP      NJ     TS ABERDEEN TWP N J      NJ     TS
 2:     ABBOTSFORD CITY      WI     CI   ABBOTSFORD WIS      WI     CI
 3:       ABERDEEN CITY      WA     CI    ABERDEEN WASH      WA     CI
 4:        ADA TOWNSHIP      MI     TS     ADA TWP MICH      MI     TS
 5: ABERCOMBIE TOWNSHIP      ND     TS             <NA>    <NA>   <NA>
 6:   ABERDEEN TOWNSHIP      SD     TS             <NA>    <NA>   <NA>
 7:               ADAMS      IL     TS             <NA>    <NA>   <NA>
 8:                <NA>    <NA>   <NA>         ADA OHIO      OH     CI
 9:                <NA>    <NA>   <NA>       ADAMS MASS      MA     CI
10:                <NA>    <NA>   <NA>   ADAMSVILLE ALA      AL     CI

此练习中的结果是正确的,请参见下文.但是,如果这两个数据中的任何NAME相同,答案将不正确.
我在这两个数据中创建了一个新观察值.

The result in this exercise is correct, see below. But If any NAME in these two data are the same, the answer will not correct.
I create a new observation in these two data.

dt1 <- fread("NAME  State   type
         ABERCOMBIE TOWNSHIP    ND  TS
         ABERDEEN TOWNSHIP  NJ  TS
         ABERDEEN TOWNSHIP  SD  TS
         ABBOTSFORD CITY    WI  CI
         ABERDEEN CITY  WA  CI
         ADA TOWNSHIP   MI  TS
         ADAMS  IL  TS
         THE SAME   AA  BB
         ", header = T)
dt2 <- fread("NAME  State   type
         ABERDEEN TWP N J   NJ  TS
         ABERDEEN WASH  WA  CI
         ABBOTSFORD WIS WI  CI
         ADA TWP MICH   MI  TS
         ADA OHIO   OH  CI
         ADAMS MASS MA  CI
         ADAMSVILLE ALA AL  CI
         THE SAME   AA  BB
         ", header = T)

fuzzy_full_join(dt1, dt2, by = c("NAME" = "NAME", "State" = "State", "type" = "type"), match_fun = list(`!=`, `==`, `==`))

                 NAME.x State.x type.x           NAME.y State.y type.y
 1:   ABERDEEN TOWNSHIP      NJ     TS ABERDEEN TWP N J      NJ     TS
 2:     ABBOTSFORD CITY      WI     CI   ABBOTSFORD WIS      WI     CI
 3:       ABERDEEN CITY      WA     CI    ABERDEEN WASH      WA     CI
 4:        ADA TOWNSHIP      MI     TS     ADA TWP MICH      MI     TS
 5: ABERCOMBIE TOWNSHIP      ND     TS             <NA>    <NA>   <NA>
 6:   ABERDEEN TOWNSHIP      SD     TS             <NA>    <NA>   <NA>
 7:               ADAMS      IL     TS             <NA>    <NA>   <NA>
 8:            THE SAME      AA     BB             <NA>    <NA>   <NA>
 9:                <NA>    <NA>   <NA>         ADA OHIO      OH     CI
10:                <NA>    <NA>   <NA>       ADAMS MASS      MA     CI
11:                <NA>    <NA>   <NA>   ADAMSVILLE ALA      AL     CI
12:                <NA>    <NA>   <NA>         THE SAME      AA     BB

这是不正确的结果. 有什么建议吗?

This is incorrect result. Any suggestion?

似乎我不能使用fuzzy_full_join.

推荐答案

这是因为您要求Fuzzy_full_join为您提供不匹配的名称(带有!=),然后声明确实匹配的类型和名称(带有== == ).因此,如果所有三个都匹配,它将不会显示.

It is because you asked fuzzy_full_join to give you NAMES that did not match (with !=) and then state and types that did match (with == ==). So if in the case all three do match, it won't show up.

您可以使用以下命令运行两次:

You could run it twice with:

match_fun = list(`!=`, `==`, `==`))
match_fun = list(`==`, `==`, `==`))

library(data.table); library(fuzzyjoin)
#> Warning: package 'data.table' was built under R version 3.5.2

dt1 <- fread("NAME   State   type
             ABERCOMBIETOWNSHIP    ND  TS
             ABERDEENTOWNSHIP  NJ  TS
             ABERDEENTOWNSHIP  SD  TS
             ABBOTSFORDCITY    WI  CI
             ABERDEENCITY  WA  CI
             ADATOWNSHIP   MI  TS
             ADAMS IL  TS
             THESAME   AA  BB
             ", header = T)
dt2 <- fread("NAME  State   type
             ABERDEENTWPNJ   NJ  TS
             ABERDEENWASH  WA  CI
             ABBOTSFORDWIS WI  CI
             ADATWPMICH   MI  TS
             ADAOHIO   OH  CI
             ADAMSMASS MA  CI
             ADAMSVILLEALA AL  CI
             THESAME   AA  BB
             ", header = T)

fuzzy_full_join(dt1, dt2, by = c("NAME" = "NAME", "State" = "State", "type" = "type"), match_fun = list(`!=`, `==`, `==`))
#>                 NAME.x State.x type.x        NAME.y State.y type.y
#>  1:   ABERDEENTOWNSHIP      NJ     TS ABERDEENTWPNJ      NJ     TS
#>  2:     ABBOTSFORDCITY      WI     CI ABBOTSFORDWIS      WI     CI
#>  3:       ABERDEENCITY      WA     CI  ABERDEENWASH      WA     CI
#>  4:        ADATOWNSHIP      MI     TS    ADATWPMICH      MI     TS
#>  5: ABERCOMBIETOWNSHIP      ND     TS          <NA>    <NA>   <NA>
#>  6:   ABERDEENTOWNSHIP      SD     TS          <NA>    <NA>   <NA>
#>  7:              ADAMS      IL     TS          <NA>    <NA>   <NA>
#>  8:            THESAME      AA     BB          <NA>    <NA>   <NA>
#>  9:               <NA>    <NA>   <NA>       ADAOHIO      OH     CI
#> 10:               <NA>    <NA>   <NA>     ADAMSMASS      MA     CI
#> 11:               <NA>    <NA>   <NA> ADAMSVILLEALA      AL     CI
#> 12:               <NA>    <NA>   <NA>       THESAME      AA     BB

fuzzy_full_join(dt1, dt2, by = c("NAME" = "NAME", "State" = "State", "type" = "type"), match_fun = list(`==`, `==`, `==`))
#>                 NAME.x State.x type.x        NAME.y State.y type.y
#>  1:            THESAME      AA     BB       THESAME      AA     BB
#>  2: ABERCOMBIETOWNSHIP      ND     TS          <NA>    <NA>   <NA>
#>  3:   ABERDEENTOWNSHIP      NJ     TS          <NA>    <NA>   <NA>
#>  4:   ABERDEENTOWNSHIP      SD     TS          <NA>    <NA>   <NA>
#>  5:     ABBOTSFORDCITY      WI     CI          <NA>    <NA>   <NA>
#>  6:       ABERDEENCITY      WA     CI          <NA>    <NA>   <NA>
#>  7:        ADATOWNSHIP      MI     TS          <NA>    <NA>   <NA>
#>  8:              ADAMS      IL     TS          <NA>    <NA>   <NA>
#>  9:               <NA>    <NA>   <NA> ABERDEENTWPNJ      NJ     TS
#> 10:               <NA>    <NA>   <NA>  ABERDEENWASH      WA     CI
#> 11:               <NA>    <NA>   <NA> ABBOTSFORDWIS      WI     CI
#> 12:               <NA>    <NA>   <NA>    ADATWPMICH      MI     TS
#> 13:               <NA>    <NA>   <NA>       ADAOHIO      OH     CI
#> 14:               <NA>    <NA>   <NA>     ADAMSMASS      MA     CI
#> 15:               <NA>    <NA>   <NA> ADAMSVILLEALA      AL     CI

reprex软件包(v0.2.1)于2019-03-17创建

这篇关于R中的模糊外部联接/合并的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆