日期累计在R [英] Date roll-up in R

查看:96
本文介绍了日期累计在R的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我在Stack Overflow上的第二篇文章,如果没有格式化的最佳方式,那么很高兴。我有一个数据集,看起来如下:

This is my second post on Stack Overflow, so sorry in advanced if it's not formatted optimally. I have a dataset that looks as the following:

    ID  FromDate    ToDate  SiteID  Cost
    1   8/12/2014   8/31/2014   12  245.98
    1   9/1/2014    9/7/2014    12  269.35
    1   10/10/2014  10/17/2014  12  209.98
    1   11/22/2014  11/30/2014  12  309.12
    1   12/1/2014   12/11/2014  12  202.14
    2   8/16/2014   8/21/2014   12  109.35
    2   8/22/2014   8/24/2014   14  44.12
    2   9/25/2014   9/29/2014   12  98.75
    3   9/15/2014   9/30/2014   23  536.27
    3   10/1/2014   10/31/2014  12  529.87
    3   11/1/2014   11/30/2014  12  969.55
    3   12/1/2014   12/12/2014  12  607.35

我想要的是:

    ID  FromDate    ToDate  SiteID  Cost
    1   8/12/2014   9/7/2014    12  515.33
    1   10/10/2014  10/17/2014  12  209.98
    1   11/22/2014  12/11/2014  12  511.26
    2   8/16/2014   8/21/2014   12  109.35
    2   8/22/2014   8/24/2014   14  44.12
    2   9/25/2014   9/29/2014   12  98.75
    3   9/15/2014   9/30/2014   23  536.27
    3   10/1/2014   12/12/2014  12  2106.77

如可以看到的,如果有延续,并且ID和SiteID对成本进行了总结,日期将被卷起。为了帮助某人理解复杂性,如果在日期间隔中有延续,但SiteID发生变化,那么它是一个单独的行。如果在日期间隔中没有延续,则它是一个单独的行。在R中怎么做?此外,我有超过100,000个个人ID。那么这个最有效的方式/包是什么?

As one can see, the dates are rolled up if there is a continuation and the costs are summed up by ID and SiteID. To help someone understand the complexity, if there is a continuation in date interval, but the SiteID changes, then it is a separate row. If there is no continuation in date interval, it is a separate row. How do I do this in R? Also, I have over 100,000 individual IDs. So what is the most efficient way/package to use for this?

推荐答案

这可能是

df %>% 
  mutate(gr = cumsum(FromDate-lag(ToDate, default=1) != 1)) %>% 
  group_by(gr, ID, SiteID) %>% 
  summarise(FromDate = min(FromDate), 
            ToDate   = max(ToDate), 
            cost     = sum(Cost))


     gr    ID SiteID   FromDate     ToDate    cost
  (int) (int)  (int)     (date)     (date)   (dbl)
1     1     1     12 2014-08-12 2014-09-07  515.33
2     2     1     12 2014-10-10 2014-10-17  209.98
3     3     1     12 2014-11-22 2014-12-11  511.26
4     4     2     12 2014-08-16 2014-08-21  109.35
5     4     2     14 2014-08-22 2014-08-24   44.12
6     5     2     12 2014-09-25 2014-09-29   98.75
7     6     3     23 2014-09-15 2014-09-30  536.27
8     6     3     12 2014-10-01 2014-12-12 2106.77

data.table

library(data.table)
setDT(df)
df[, gr := cumsum(FromDate - shift(ToDate, fill=1) != 1),
   ][, list(FromDate=min(FromDate), ToDate=max(ToDate), cost=sum(Cost)), by=.(gr, ID, SiteID)]



   gr ID SiteID   FromDate     ToDate    cost
1:  1  1     12 2014-08-12 2014-09-07  515.33
2:  2  1     12 2014-10-10 2014-10-17  209.98
3:  3  1     12 2014-11-22 2014-12-11  511.26
4:  4  2     12 2014-08-16 2014-08-21  109.35
5:  4  2     14 2014-08-22 2014-08-24   44.12
6:  5  2     12 2014-09-25 2014-09-29   98.75
7:  6  3     23 2014-09-15 2014-09-30  536.27
8:  6  3     12 2014-10-01 2014-12-12 2106.77

这篇关于日期累计在R的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆