dplyr或data.table在R中计算时间序列聚合 [英] dplyr or data.table to calculate time series aggregations in R

查看:90
本文介绍了dplyr或data.table在R中计算时间序列聚合的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试总结一个 data.frame ,其中包含日期(或时间)信息。

I'm trying to summarize a data.frame which contains date (or time) information.

假设其中包含按患者住院记录:

Let's suppose this one containing hospitalization records by patient:

df <- data.frame(c(1, 2, 1, 1, 2, 2),
             c(as.Date("2013/10/15"), as.Date("2014/10/15"), as.Date("2015/7/16"), as.Date("2016/1/7"), as.Date("2015/12/20"), as.Date("2015/12/25")))
names(df) <- c("patient.id", "hospitalization.date")

df 看起来像这样:

> df
      patient.id hospitalization.date
    1          1           2013-10-15
    2          2           2014-10-15
    3          1           2015-07-16
    4          1           2016-01-07
    5          2           2015-12-20
    6          2           2015-12-25

对于每个观察,我需要计算住院前365天的住院次数。

For each observation, I need to count the number of hospitalizations occuring in the 365 days before that hospitalization.

在我的示例中,这将是新的 df $ hospitalizations.last.year 列。

In my example it would be the new df$hospitalizations.last.year column.

> df
      patient.id hospitalization.date hospitalizations.last.year
    1          1           2013-10-15                          1
    2          2           2014-10-15                          1
    3          1           2015-07-16                          1
    4          2           2015-12-20                          1
    5          2           2015-12-25                          2
    6          1           2016-01-07                          2
    7          2           2016-02-10                          3

请注意,计数器不仅包括最近365天的先前记录数,还包括本年度。

Note that the counter is including the number of previous records in the last 365 days, not only in the current year.

我正在尝试使用 dplyr data.table ,因为我的数据集很大,而性能很重要。

I'm trying to do that using dplyr or data.table because my dataset is huge and performance matters. ¿Is it possible?

推荐答案

从1.9.8版开始(2016年11月25日,CRAN), data .table 提供非等额联接

Since version 1.9.8 (on CRAN 25 Nov 2016), data.table offers non-equi joins:

library(data.table)
# coerce to data.table
setDT(df)[
  # create helper column
  , date_365 := hospitalization.date - 365][
    # step1: non-equi self-join
    df, on = c("patient.id", "hospitalization.date>=date_365", 
               "hospitalization.date<=hospitalization.date")][
      # step 2: count hospitalizations.last.year for each patient
      , .(hospitalizations.last.year = .N), 
      by = .(patient.id, hospitalization.date = hospitalization.date.1)]



   patient.id hospitalization.date hospitalizations.last.year
1:          1           2013-10-15                          1
2:          2           2014-10-15                          1
3:          1           2015-07-16                          1
4:          2           2015-12-20                          1
5:          2           2015-12-25                          2
6:          1           2016-01-07                          2
7:          2           2016-02-10                          3





编辑:可以将合并和汇总合并为一个步骤:


Join and aggregation can be combined in one step:

# coerce to data.table
setDT(df)[
  # create helper column
  , date_365 := hospitalization.date - 365][
    # non-equi self-join
    df, on = c("patient.id", "hospitalization.date>=date_365", 
               "hospitalization.date<=hospitalization.date"), 
    # count hospitalizations.last.year grouped by join parameters
    .(hospitalizations.last.year = .N), by = .EACHI][
      # remove duplicate column
      , hospitalization.date := NULL][]

结果与上面相同。

OP提供了两个分别具有6和7行的数据集。这里,使用具有7行的数据集,因为它是按预期结果发布的:

The OP has provided two data sets with 6 and 7 rows, resp. Here, the data set with 7 rows is used as it was posted as expected result:

df <- data.frame(
  patient.id = c(1L, 2L, 1L, 1L, 2L, 2L, 2L),
  hospitalization.date = as.Date(c("2013/10/15", "2014/10/15", "2015/7/16", 
                                   "2016/1/7", "2015/12/20", "2015/12/25", "2016/2/10")))
df <- df[order(df$hospitalization.date), ]

这篇关于dplyr或data.table在R中计算时间序列聚合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆