进行“模糊"处理而且不模糊,与data.table多对1合并 [英] Doing a "fuzzy" and non-fuzzy, many to 1 merge with data.table

查看:84
本文介绍了进行“模糊"处理而且不模糊,与data.table多对1合并的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有两个数据库dfAdfB.一个具有单独的观察值,一个具有国家/地区级别的数据(适用于来自同一年和国家/地区的多个观察值).对于这些数据库中的每一个,我都创建了一个名为matchcode的键.此匹配代码是国家/地区代码和年份的组合.

Lets assume I have two databases dfA and dfB. One has individual observations and one has country level data (which is applicable to multiple observations which are from the same year and country) For each of these databases I have created a key called matchcode. This matchcode is a combination of a country code and a year.

   dfA <- read.table(
  text = "A   B   C   D   E   F   G   iso   year   matchcode
  1   0   1   1   1   0   1   0   NLD   2010   NLD2010
  2   1   0   0   0   1   0   1   NLD   2014   NLD2014
  3   0   0   0   1   1   0   0   AUS   2010   AUS2010
  4   1   0   1   0   0   1   0   AUS   2006   AUS2006
  5   0   1   0   1   0   1   1   USA   2008   USA2008
  6   0   0   1   0   0   0   1   USA   2010   USA2010
  7   0   1   0   1   0   0   0   USA   2012   USA2012
  8   1   0   1   0   0   1   0   BLG   2008   BLG2008
  9   0   1   0   1   1   0   1   BEL   2008   BEL2008
  10  1   0   1   0   0   1   0   BEL   2010   BEL2010
  11  0   1   1   1   0   1   0   NLD   2010   NLD2010
  12  1   0   0   0   1   0   1   NLD   2014   NLD2014
  13  0   0   0   1   1   0   0   AUS   2010   AUS2010
  14  1   0   1   0   0   1   0   AUS   2006   AUS2006
  15  0   1   0   1   0   1   1   USA   2008   USA2008
  16  0   0   1   0   0   0   1   USA   2010   USA2010
  17  0   1   0   1   0   0   0   USA   2012   USA2012
  18  1   0   1   0   0   1   0   BLG   2008   BLG2008
  19  0   1   0   1   1   0   1   BEL   2008   BEL2008
  20  1   0   1   0   0   1   0   BEL   2010   BEL2010",
  header = TRUE
)

   dfB <- read.table(
  text = "A   B   C   D   H   I   J   iso   year   matchcode
  1   0   1   1   1   0   1   0   NLD   2009   NLD2009
  2   1   0   0   0   1   0   1   NLD   2014   NLD2014
  3   0   0   0   1   1   0   0   AUS   2011   AUS2011
  4   1   0   1   0   0   1   0   AUS   2007   AUS2007
  5   0   1   0   1   0   1   1   USA   2007   USA2007
  6   0   0   1   0   0   0   1   USA   2011   USA2010
  7   0   1   0   1   0   0   0   USA   2013   USA2013
  8   1   0   1   0   0   1   0   BLG   2007   BLG2007
  9   0   1   0   1   1   0   1   BEL   2009   BEL2009
  10   1   0   1   0   0   1   0  BEL   2012   BEL2012",
  header = TRUE
)

library(data.table)
setDT(dfA)
setDT(dfB)

大多数情况下,当我合并这些数据集时,我只会做:

Mostly when I merge these datasets I simply do:

dfA<- merge(dfA, dfB, by= "matchcode", all.x = TRUE, allow.cartesian=FALSE)

问题在于有时年份不完全匹配.所以我尝试了:

The problem is that sometimes the years do not completely match. So I tried:

dfA <- dfA[dfB, on = .(iso, year), roll = "nearest", nomatch = 0]

但是这将观测值减少到11.

But this reduces the amount of observations to 11.

# A tibble: 11 x 18
       A     B     C     D     E     F     G iso    year matchcode     K     L     M     N     O     P     Q i.matchcode
   <int> <int> <int> <int> <int> <int> <int> <fct> <int> <fct>     <int> <int> <int> <int> <int> <int> <int> <fct>      
 1     0     1     1     1     0     1     0 NLD    2009 NLD2010       0     1     1     1     0     1     0 NLD2009    
 2     1     0     0     0     1     0     1 NLD    2014 NLD2014       1     0     0     0     1     0     1 NLD2014    
 3     1     0     0     0     1     0     1 NLD    2014 NLD2014       1     0     0     0     1     0     1 NLD2014    
 4     0     0     0     1     1     0     0 AUS    2011 AUS2010       0     0     0     1     1     0     0 AUS2011    
 5     1     0     1     0     0     1     0 AUS    2007 AUS2006       1     0     1     0     0     1     0 AUS2007    
 6     0     1     0     1     0     1     1 USA    2007 USA2008       0     1     0     1     0     1     1 USA2007    
 7     0     0     1     0     0     0     1 USA    2011 USA2010       0     0     1     0     0     0     1 USA2010    
 8     0     1     0     1     0     0     0 USA    2013 USA2012       0     1     0     1     0     0     0 USA2013    
 9     1     0     1     0     0     1     0 BLG    2007 BLG2008       1     0     1     0     0     1     0 BLG2007    
10     0     1     0     1     1     0     1 BEL    2009 BEL2008       0     1     0     1     1     0     1 BEL2009    
11     1     0     1     0     0     1     0 BEL    2012 BEL2010       1     0     1     0     0     1     0 BEL2012   

首选输出如下:

#    A B C D E F G iso year matchcodeA H I J matchcodeB
# 1: 1 0 0 0 1 0 1 NLD  2014  NLD2014  1 0 1    NLD2014
# 2: 0 0 0 1 1 0 0 AUS  2011  AUS2010  1 0 0    AUS2011
# 3: 1 0 1 0 0 1 0 AUS  2007  AUS2006  0 1 0    AUS2007
# 4: 0 0 1 0 0 0 1 USA  2011  USA2010  0 0 1    USA2010
# 5: 0 1 0 1 0 0 0 USA  2013  USA2012  0 0 0    USA2013
# 6: 0 1 0 1 1 0 1 BEL  2009  BEL2008  1 0 1    BEL2009
# 7: 0 1 1 1 0 1 0 NLD  2009  NLD2010  0 1 0    NLD2009
# 8: 0 1 0 1 0 1 1 USA  2007  USA2008  0 1 1    USA2007
# 9: 0 1 0 1 0 0 0 USA  2011  USA2012  0 0 1    USA2010
#10: 1 0 1 0 0 1 0 BEL  2009  BEL2010  1 0 1    BEL2009
#11: 1 0 0 0 1 0 1 NLD  2014  NLD2014  1 0 1    NLD2014
#12: 0 0 0 1 1 0 0 AUS  2011  AUS2010  1 0 0    AUS2011
#13: 1 0 1 0 0 1 0 AUS  2007  AUS2006  0 1 0    AUS2007
#14: 0 0 1 0 0 0 1 USA  2011  USA2010  0 0 1    USA2010
#15: 0 1 0 1 0 0 0 USA  2013  USA2012  0 0 0    USA2013
#16: 0 1 0 1 1 0 1 BEL  2009  BEL2008  1 0 1    BEL2009
#17: 0 1 1 1 0 1 0 NLD  2009  NLD2010  0 1 0    NLD2009
#18: 0 1 0 1 0 1 1 USA  2007  USA2008  0 1 1    USA2007
#19: 0 1 0 1 0 0 0 USA  2011  USA2012  0 0 1    USA2010
#20: 1 0 1 0 0 1 0 BEL  2009  BEL2010  1 0 1    BEL2009

其他来源:

1.上一次尝试

推荐答案

对于这种连接,使用data.table

代码

library( data.table )

#change the name of the matchcode-column
setnames(dfA, c("matchcode", "iso", "year"), c("matchcodeA", "isoA", "yearA"))
setnames(dfB, c("matchcode", "iso", "year"), c("matchcodeB", "isoB", "yearB"))

#store column-order for in the end
namesA <- as.character( names( dfA ) )
namesB <- as.character( setdiff( names(dfB), names(dfA) ) )
colorder <- c(namesA, namesB)

#create columns to join on
dfA[, `:=`(iso.join = isoA, year.join = yearA)]
dfB[, `:=`(iso.join = isoB, year.join = yearB)]

#perform left join
result <- dfB[dfA, on = c("iso.join", "year.join"),roll = "nearest" ]

#drop columns that are not needed
result[, grep("^i\\.", names(result)) := NULL ]
result[, grep("join$", names(result)) := NULL ]

#set column order
setcolorder(result, colorder)

结果

#     A B C D E F G isoA yearA matchcodeA H I J isoB yearB matchcodeB
#  1: 0 1 1 1 0 1 0  NLD  2010    NLD2010 0 1 0  NLD  2009    NLD2009
#  2: 1 0 0 0 1 0 1  NLD  2014    NLD2014 1 0 1  NLD  2014    NLD2014
#  3: 0 0 0 1 1 0 0  AUS  2010    AUS2010 1 0 0  AUS  2011    AUS2011
#  4: 1 0 1 0 0 1 0  AUS  2006    AUS2006 0 1 0  AUS  2007    AUS2007
#  5: 0 1 0 1 0 1 1  USA  2008    USA2008 0 1 1  USA  2007    USA2007
#  6: 0 0 1 0 0 0 1  USA  2010    USA2010 0 0 1  USA  2011    USA2010
#  7: 0 0 1 0 0 0 0  USA  2012    USA2012 0 0 1  USA  2011    USA2010
#  8: 1 0 1 0 0 1 0  BLG  2008    BLG2008 0 1 0  BLG  2007    BLG2007
#  9: 0 1 0 1 1 0 1  BEL  2008    BEL2008 1 0 1  BEL  2009    BEL2009
# 10: 0 1 0 1 0 1 0  BEL  2010    BEL2010 1 0 1  BEL  2009    BEL2009
# 11: 0 1 1 1 0 1 0  NLD  2010    NLD2010 0 1 0  NLD  2009    NLD2009
# 12: 1 0 0 0 1 0 1  NLD  2014    NLD2014 1 0 1  NLD  2014    NLD2014
# 13: 0 0 0 1 1 0 0  AUS  2010    AUS2010 1 0 0  AUS  2011    AUS2011
# 14: 1 0 1 0 0 1 0  AUS  2006    AUS2006 0 1 0  AUS  2007    AUS2007
# 15: 0 1 0 1 0 1 1  USA  2008    USA2008 0 1 1  USA  2007    USA2007
# 16: 0 0 1 0 0 0 1  USA  2010    USA2010 0 0 1  USA  2011    USA2010
# 17: 0 0 1 0 0 0 0  USA  2012    USA2012 0 0 1  USA  2011    USA2010
# 18: 1 0 1 0 0 1 0  BLG  2008    BLG2008 0 1 0  BLG  2007    BLG2007
# 19: 0 1 0 1 1 0 1  BEL  2008    BEL2008 1 0 1  BEL  2009    BEL2009
# 20: 0 1 0 1 0 1 0  BEL  2010    BEL2010 1 0 1  BEL  2009    BEL2009

样本数据

dfA <- fread(
  "A   B   C   D   E   F   G   iso   year   matchcode
  0   1   1   1   0   1   0   NLD   2010   NLD2010
     1   0   0   0   1   0   1   NLD   2014   NLD2014
     0   0   0   1   1   0   0   AUS   2010   AUS2010
     1   0   1   0   0   1   0   AUS   2006   AUS2006
     0   1   0   1   0   1   1   USA   2008   USA2008
     0   0   1   0   0   0   1   USA   2010   USA2010
     0   1   0   1   0   0   0   USA   2012   USA2012
     1   0   1   0   0   1   0   BLG   2008   BLG2008
     0   1   0   1   1   0   1   BEL   2008   BEL2008
    1   0   1   0   0   1   0   BEL   2010   BEL2010
    0   1   1   1   0   1   0   NLD   2010   NLD2010
    1   0   0   0   1   0   1   NLD   2014   NLD2014
    0   0   0   1   1   0   0   AUS   2010   AUS2010
    1   0   1   0   0   1   0   AUS   2006   AUS2006
    0   1   0   1   0   1   1   USA   2008   USA2008
    0   0   1   0   0   0   1   USA   2010   USA2010
    0   1   0   1   0   0   0   USA   2012   USA2012
    1   0   1   0   0   1   0   BLG   2008   BLG2008
    0   1   0   1   1   0   1   BEL   2008   BEL2008
    1   0   1   0   0   1   0   BEL   2010   BEL2010",
  header = TRUE
)


dfB <- fread(
  "A   B   C   D   H   I   J   iso   year   matchcode
     0   1   1   1   0   1   0   NLD   2009   NLD2009
     1   0   0   0   1   0   1   NLD   2014   NLD2014
     0   0   0   1   1   0   0   AUS   2011   AUS2011
     1   0   1   0   0   1   0   AUS   2007   AUS2007
     0   1   0   1   0   1   1   USA   2007   USA2007
     0   0   1   0   0   0   1   USA   2011   USA2010
     0   1   0   1   0   0   0   USA   2013   USA2013
     1   0   1   0   0   1   0   BLG   2007   BLG2007
     0   1   0   1   1   0   1   BEL   2009   BEL2009
     1   0   1   0   0   1   0  BEL   2012   BEL2012",
  header = TRUE
)

这篇关于进行“模糊"处理而且不模糊,与data.table多对1合并的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆