dplyr或data.table在R中计算时间序列聚合 [英] dplyr or data.table to calculate time series aggregations in R
问题描述
我正在尝试总结一个 data.frame
,其中包含日期(或时间)信息。
I'm trying to summarize a data.frame
which contains date (or time) information.
假设其中包含按患者住院记录:
Let's suppose this one containing hospitalization records by patient:
df <- data.frame(c(1, 2, 1, 1, 2, 2),
c(as.Date("2013/10/15"), as.Date("2014/10/15"), as.Date("2015/7/16"), as.Date("2016/1/7"), as.Date("2015/12/20"), as.Date("2015/12/25")))
names(df) <- c("patient.id", "hospitalization.date")
df
看起来像这样:
> df
patient.id hospitalization.date
1 1 2013-10-15
2 2 2014-10-15
3 1 2015-07-16
4 1 2016-01-07
5 2 2015-12-20
6 2 2015-12-25
对于每个观察,我需要计算住院前365天的住院次数。
For each observation, I need to count the number of hospitalizations occuring in the 365 days before that hospitalization.
在我的示例中,这将是新的 df $ hospitalizations.last.year
列。
In my example it would be the new df$hospitalizations.last.year
column.
> df
patient.id hospitalization.date hospitalizations.last.year
1 1 2013-10-15 1
2 2 2014-10-15 1
3 1 2015-07-16 1
4 2 2015-12-20 1
5 2 2015-12-25 2
6 1 2016-01-07 2
7 2 2016-02-10 3
请注意,计数器不仅包括最近365天的先前记录数,还包括本年度。
Note that the counter is including the number of previous records in the last 365 days, not only in the current year.
我正在尝试使用 dplyr
或 data.table
,因为我的数据集很大,而性能很重要。
I'm trying to do that using dplyr
or data.table
because my dataset is huge and performance matters. ¿Is it possible?
推荐答案
从1.9.8版开始(2016年11月25日,CRAN), data .table
提供非等额联接:
Since version 1.9.8 (on CRAN 25 Nov 2016), data.table
offers non-equi joins:
library(data.table)
# coerce to data.table
setDT(df)[
# create helper column
, date_365 := hospitalization.date - 365][
# step1: non-equi self-join
df, on = c("patient.id", "hospitalization.date>=date_365",
"hospitalization.date<=hospitalization.date")][
# step 2: count hospitalizations.last.year for each patient
, .(hospitalizations.last.year = .N),
by = .(patient.id, hospitalization.date = hospitalization.date.1)]
patient.id hospitalization.date hospitalizations.last.year
1: 1 2013-10-15 1
2: 2 2014-10-15 1
3: 1 2015-07-16 1
4: 2 2015-12-20 1
5: 2 2015-12-25 2
6: 1 2016-01-07 2
7: 2 2016-02-10 3
编辑:可以将合并和汇总合并为一个步骤:
Join and aggregation can be combined in one step:
# coerce to data.table
setDT(df)[
# create helper column
, date_365 := hospitalization.date - 365][
# non-equi self-join
df, on = c("patient.id", "hospitalization.date>=date_365",
"hospitalization.date<=hospitalization.date"),
# count hospitalizations.last.year grouped by join parameters
.(hospitalizations.last.year = .N), by = .EACHI][
# remove duplicate column
, hospitalization.date := NULL][]
结果与上面相同。
OP提供了两个分别具有6和7行的数据集。这里,使用具有7行的数据集,因为它是按预期结果发布的:
The OP has provided two data sets with 6 and 7 rows, resp. Here, the data set with 7 rows is used as it was posted as expected result:
df <- data.frame(
patient.id = c(1L, 2L, 1L, 1L, 2L, 2L, 2L),
hospitalization.date = as.Date(c("2013/10/15", "2014/10/15", "2015/7/16",
"2016/1/7", "2015/12/20", "2015/12/25", "2016/2/10")))
df <- df[order(df$hospitalization.date), ]
这篇关于dplyr或data.table在R中计算时间序列聚合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!