通过匹配日期合并2个数据框 [英] Merge 2 dataframes by matching dates

查看:133
本文介绍了通过匹配日期合并2个数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个数据框:

  id dates 
MUM-1 2015-07-10
MUM-1 2015-07-11
MUM-1 2015-07-12
MUM-2 2014-01-14
MUM-2 2014-01-15
MUM- 2 2014-01-16
MUM-2 2014-01-17

和: / p>

  id日期field1 field2 
MUM-1 2015-07-10 1 0
MUM-1 2015- 07-12 2 1
MUM-2 2014-01-14 4 3
MUM-2 2014-01-17 0 1

合并数据:

  id日期field1 field2 
MUM-1 2015-07-10 1 0
MUM-1 2015-07-11 na na
MUM-1 2015-07-12 2 1
MUM-2 2014-01-14 4 3
MUM-2 2014-01-15 na na
MUM-2 2014-01-16 na na
MUM-2 2014-01-17 0 1

代码: merge(x = df1,y = df2,by ='id',all.x = T)我正在使用



合并,但由于两个数据帧的大小太大,处理时间过长。是否有任何替代合并功能?也许在dplyr?所以它在比较中处理速度很快。两个数据帧都有超过900K行。

解决方案

而不是使用 merge data.table ,您也可以简单地加入如下:

  setDT(df1)
setDT(df2)

df2 [df1,on = c('id','dates')]
pre>

这给出:

 > df2 [df1] 
id日期field1 field2
1:MUM-1 2015-07-10 1 0
2:M​​UM-1 2015-07-11 NA NA
3: MUM-1 2015-07-12 2 1
4:MUM-2 2014-01-14 4 3
5:MUM-2 2014-01-15 NA NA
6:MUM- 2 2014-01-16 NA NA
7:MUM-2 2014-01-17 0 1

使用 dplyr 执行此操作:

  library(dplyr)
dplr< - left_join(df1,df2,by = c(id,dates))

正如@Arun在评论中所提到的那样,在具有七行的小数据集上,基准测试不是很有意义。因此,我们可以创建一些更大的数据集:

  dt1 < -  data.table(id = gl(2,730,labels = c (MUM-1,MUM-2)),
dates = c(seq(as.Date(2010-01-01),as.Date(2011-12-31) ,by =days),
seq(as.Date(2013-01-01),as.Date(2014-12-31),by =days)))
dt2 < - data.table(id = gl(2,730,labels = c(MUM-1,MUM-2)),
dates = c(seq(as.Date 2010-01-01),as.Date(2011-12-31),by =days),
seq(as.Date(2013-01-01),as。日期(2014-12-31),by =days)),
field1 = sample(c(0,1,2,3,4),size = 730,replace = TRUE),
field2 = sample(c(0,1,2,3,4),size = 730,replace = TRUE))
dt2 < - dt2 [sample(nrow(dt2),800)]

可以看出,@ Arun的方法稍快一些:

  library(rbenchmark)
benchmark(replications = 10,order =elapsed,columns = c(test,elapsed,relat ive),
jaap = dt2 [dt1,on = c('id','dates')],
pavo = merge(dt1,dt2,by =id,allow.cartesian = T,
dplr = left_join(dt1,dt2,by = c(id,dates)),
arun = dt1 [dt2,c(fiedl1,field2): =。(field1,field2),on = c(id,dates)])

测试过去的相对
4 arun 0.015 1.000
1 jaap 0.016 1.067
3 dplr 0.037 2.467
2 pavo 1.033 68.867

数据集,请参阅 @Arun的答案


I have two dataframes:

id      dates
MUM-1  2015-07-10
MUM-1  2015-07-11
MUM-1  2015-07-12
MUM-2  2014-01-14
MUM-2  2014-01-15
MUM-2  2014-01-16
MUM-2  2014-01-17

and:

id      dates      field1  field2
MUM-1  2015-07-10     1       0
MUM-1  2015-07-12     2       1
MUM-2  2014-01-14     4       3
MUM-2  2014-01-17     0       1

merged data:

id      dates        field1   field2
MUM-1  2015-07-10      1         0
MUM-1  2015-07-11      na        na
MUM-1  2015-07-12      2         1
MUM-2  2014-01-14      4         3
MUM-2  2014-01-15      na        na
MUM-2  2014-01-16      na        na
MUM-2  2014-01-17      0         1   

code: merge(x= df1, y= df2, by= 'id', all.x= T)

I am using merge but since the size of both dataframes are too huge, it is taking too long to process. Is there any alternative to the merge function? Maybe in dplyr? So that it processes fast in comparision. Both dataframes have more than 900K rows.

解决方案

Instead of using merge with data.table, you can also simply join as follows:

setDT(df1)
setDT(df2)

df2[df1, on = c('id','dates')]

this gives:

> df2[df1]
      id      dates field1 field2
1: MUM-1 2015-07-10      1      0
2: MUM-1 2015-07-11     NA     NA
3: MUM-1 2015-07-12      2      1
4: MUM-2 2014-01-14      4      3
5: MUM-2 2014-01-15     NA     NA
6: MUM-2 2014-01-16     NA     NA
7: MUM-2 2014-01-17      0      1

Doing this with dplyr:

library(dplyr)
dplr <- left_join(df1, df2, by=c("id","dates"))

As mentioned by @Arun in the comments, a benchmark is not very meaningfull on a small dataset with seven rows. So lets create some bigger datasets:

dt1 <- data.table(id=gl(2, 730, labels = c("MUM-1", "MUM-2")),
                  dates=c(seq(as.Date("2010-01-01"), as.Date("2011-12-31"), by="days"),
                          seq(as.Date("2013-01-01"), as.Date("2014-12-31"), by="days")))
dt2 <- data.table(id=gl(2, 730, labels = c("MUM-1", "MUM-2")),
                  dates=c(seq(as.Date("2010-01-01"), as.Date("2011-12-31"), by="days"),
                          seq(as.Date("2013-01-01"), as.Date("2014-12-31"), by="days")),
                  field1=sample(c(0,1,2,3,4), size=730, replace = TRUE),
                  field2=sample(c(0,1,2,3,4), size=730, replace = TRUE))
dt2 <- dt2[sample(nrow(dt2), 800)]

As can be seen, @Arun's approach is slightly faster:

library(rbenchmark)
benchmark(replications = 10, order = "elapsed", columns = c("test", "elapsed", "relative"),
          jaap = dt2[dt1, on = c('id','dates')],
          pavo = merge(dt1,dt2,by="id",allow.cartesian=T),
          dplr = left_join(dt1, dt2, by=c("id","dates")),
          arun = dt1[dt2, c("fiedl1", "field2") := .(field1, field2), on=c("id", "dates")])

  test elapsed relative
4 arun   0.015    1.000
1 jaap   0.016    1.067
3 dplr   0.037    2.467
2 pavo   1.033   68.867

For a comparison on a large dataset, see the answer of @Arun.

这篇关于通过匹配日期合并2个数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆