使用模糊逻辑连接两个数据集 [英] Joining two datasets using fuzzy logic

查看:124
本文介绍了使用模糊逻辑连接两个数据集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在两个数据集之间的R中进行模糊逻辑联接:

I’m trying to do a fuzzy logic join in R between two datasets:

  • 第一个数据集具有一个位置的名称和一个名为config
  • 的列
  • 第二个数据集具有一个位置的名称和两个附加属性,在将它们加入第一个数据集之前需要对其进行汇总.
  • first data set has the name of a location and a column called config
  • second data set has the name of a location and two additional attributes that need to be summarized before they are joined to the first data set.

我想使用name列在两个数据集之间进行联接.但是,name列在数据集中可能包含其他字符或前导字符,或者在较大的单词内部包含一个单词.因此,例如,如果我们查看这两个数据集,我希望名称OPAL加入OPALAS,而SAUSALITO Y加入SAUSALITO.

I would like to use the name column to join between the two data sets. However the name column may have additional or leading characters in either data set or have one word contained inside of a larger word. So for example if we looked at these two data sets, I'd like the name OPAL to join to the OPALAS, and SAUSALITO Y to join to SAUSALITO.

Dataset1:    
     Name           Config
     ALTO D         BB
     CONTRA         ST
     EIGHT A        DD
     OPALAS         BB
     SAUSALITO Y    AA
     SOLANO J       ST

Dataset2:    
    Name       Age     Rank
    ALTO D     50      2
    ALTO D     20      6
    CONTRA     10      10
    CONTRA     15      15
    EIGHTH     18      21
    OPAL       19      4
    SAUSALITO  2       12
    SOLANO     34      43

数据集2汇总代码

Data2a <- summaryBy(Age ~ Name,FUN=c(mean), data=Data2,na.rm=TRUE)
Data2b <- summaryBy(Rank ~ Name,FUN=c(sum), data=Data2,na.rm=TRUE)
Data2 <- data.frame(Data2a$Name, Data2a$Age.mean, Data2b$Rank.sum)

Desired Outcome:
    Name        Config  Age   Rank
    ALTO D      BB      35    8
    CONTRA      ST      12.5  25
    EIGHT A     DD      18    21
    OPALAS      BB      19    4
    SAUSALITO Y AA      12    5
    SOLANO J    ST      34    43

推荐答案

我能够使用Fuzzyjoin包将两个数据集结合起来:

I was able to join the two datasets, using the fuzzyjoin package:

library(fuzzyjoin)
stringdist_inner_join(Dataset1, Data2,
     by ="Name", distance_col = NULL)

这篇关于使用模糊逻辑连接两个数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆