按时间间隔聚合一个数据帧与另一个数据帧 [英] Aggregate one data frame by time intervals from another data frame
问题描述
我正在尝试汇总两个数据帧( df1
和 df2
).
I'm trying to aggregate two data frames (df1
and df2
).
第一个包含3个变量: ID
, Date1
和 Date2
.
The first contains 3 variables: ID
, Date1
and Date2
.
df1
ID Date1 Date2
1 2016-03-01 2016-04-01
1 2016-04-01 2016-05-01
2 2016-03-14 2016-04-15
2 2016-04-15 2016-05-17
3 2016-05-01 2016-06-10
3 2016-06-10 2016-07-15
第二个还包含3个变量: ID
, Date3
和 Value
.
The second also contains 3 variables: ID
, Date3
and Value
.
df2
ID Date3 Value
1 2016-03-15 5
1 2016-04-04 7
1 2016-04-28 7
2 2016-03-18 3
2 2016-03-27 5
2 2016-04-08 9
2 2016-04-20 2
3 2016-05-05 6
3 2016-05-25 8
3 2016-06-13 3
想法是为每个 df1
行获取具有相同 ID
且其中 Date3
在 Date1
和 Date2
之间:
The idea is to get, for each df1
row, the sum of df2$Value
that have the same ID
and for which Date3
is between Date1
and Date2
:
ID Date1 Date2 SumValue
1 2016-03-01 2016-04-01 5
1 2016-04-01 2016-05-01 14
2 2016-03-14 2016-04-15 17
2 2016-04-15 2016-05-17 2
3 2016-05-01 2016-06-10 14
3 2016-06-10 2016-07-15 3
我知道如何对此进行循环,但是数据帧很大!有人有有效的解决方案吗?正在浏览 data.table
, plyr
和 dplyr
,但找不到解决方案.
I know how to make a loop on this, but the data frames are huge! Does someone has an efficient solution? Exploring data.table
, plyr
and dplyr
but could not find a solution.
推荐答案
几个应该很好扩展的 data.table
解决方案(以及实现非等额联接之前的一个很好的权宜之计):
A couple of data.table
solutions that should scale well (and a good stop-gap until non-equi joins are implemented):
使用 by = EACHI
在J中进行比较.
library(data.table)
setDT(df1)
setDT(df2)
df1[, `:=`(Date1 = as.Date(Date1), Date2 = as.Date(Date2))]
df2[, Date3 := as.Date(Date3)]
df1[ df2,
{
idx = Date1 <= i.Date3 & i.Date3 <= Date2
.(Date1 = Date1[idx],
Date2 = Date2[idx],
Date3 = i.Date3,
Value = i.Value)
},
on=c("ID"),
by=.EACHI][, .(sumValue = sum(Value)), by=.(ID, Date1, Date2)]
# ID Date1 Date2 sumValue
# 1: 1 2016-03-01 2016-04-01 5
# 2: 1 2016-04-01 2016-05-01 14
# 3: 2 2016-03-14 2016-04-15 17
# 4: 2 2016-04-15 2016-05-17 2
# 5: 3 2016-05-01 2016-06-10 14
# 6: 3 2016-06-10 2016-07-15 3
foverlap
连接(如注释中所建议)
foverlap
join (as suggested in the comments)
library(data.table)
setDT(df1)
setDT(df2)
df1[, `:=`(Date1 = as.Date(Date1), Date2 = as.Date(Date2))]
df2[, Date3 := as.Date(Date3)]
df2[, Date4 := Date3]
setkey(df1, ID, Date1, Date2)
foverlaps(df2,
df1,
by.x=c("ID", "Date3", "Date4"),
type="within")[, .(sumValue = sum(Value)), by=.(ID, Date1, Date2)]
# ID Date1 Date2 sumValue
# 1: 1 2016-03-01 2016-04-01 5
# 2: 1 2016-04-01 2016-05-01 14
# 3: 2 2016-03-14 2016-04-15 17
# 4: 2 2016-04-15 2016-05-17 2
# 5: 3 2016-05-01 2016-06-10 14
# 6: 3 2016-06-10 2016-07-15 3
进一步阅读
这篇关于按时间间隔聚合一个数据帧与另一个数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!