R data.table使用日期的组子集的总和 [英] R data.table sum of group subset using dates

查看:90
本文介绍了R data.table使用日期的组子集的总和的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个像下面这样的数据集:

  library(data.table)
dt1 <-data .table(urn = c(rep( a,5),rep( b,4)),
金额= c(10,12,23,15,19,42,11,5, 10),
日期= as.Date(c( 2016-01-01, 2017-01-02, 2017-02-04,
2017-04-19 , 2018-02-11, 2016-02-14,
2017-05-06, 2017-05-12, 2017-12-12)))
dt1
#amount金额日期
#1:a 10 2016-01-01
#2:a 12 2017-01-02
#3:a 23 2017-02 -04
#4:15 2017-04-19
#5:a 19 2018-02-11
#6:b 42 2016-02-14
#7 :b 11 2017-05-06
#8:b 5 2017-05-12
#9:b 10 2017-12-12
data.table 中使用 shift 进行向后或向前扫描,这是我无法获得的最大挑战我的想法是如何根据每条 urn 拥有多少条记录来知道何时可以更改数目的总记录数。



我正在寻找的结果类型为:

  dt1 
#ur金额日期总计1200万美元$ b#1:a 10 2016-01-01 10
#2:a 12 2017-01-02 12
#3:a 23 2017-02-04 35
#4: a 15 2017-04-19 50
#5:a 19 2018-02-11 34
#6:b 42 2016-02-14 42
#7:b 11 2017-05 -06 11
#8:b 5 2017-05-12 16
#9:b 10 2017-12-12 26

由于我的数据量大,我最好寻找一个 data.table 解决方案,但愿意接受其他选择如果在大约1200万个记录的表上可能有效率的话,也是如此

解决方案

作为 foverlaps()的替代方案,通过聚集非等额联接来解决:

  library(lubridate)
dt1 [,summed12m := dt1 [。(urn,date,date%m-%months(12)),
on =。(urn = V1,date< = V2,date> = V3),
sum(amount),按= .EACHI] $ V1] []



  amount金额日期总计1200万美元b $ b 1:a 10 2016-01-01 10 
2:a 12 2017-01-02 12
3:a 23 2017-02- 04 35
4:15 2017-04-19 50
5:a 19 2018-02-11 34
6:b 42 2016-02-14 42
7: b 11 2017-05-06 11
8:b 5 2017-05-12 16
9:b 10 2017-12-12 26


lubridate 用于日期算术,以防万一其中一个日期是2月ruary,29岁。


必不可少的部分是非公平加入

  dt1 [。(urn,date,date%m-%months(12)),
on =。(urn = V1,date< = V2,date> = V3),
之和(金额),由= .EACHI]



  ur日期日期V1 
1:a 2016-01-01 2015-01-01 10
2:a 2017-01-02 2016-01-02 12
3:a 2017-02 -04 2016-02-04 35
4:a 2017-04-19 2016-04-19 50
5:a 2018-02-11 2017-02-11 34
6: b 2016-02-14 2015-02-14 42
7:b 2017-05-06 2016-05-06 11
8:b 2017-05-12 2016-05-12 16
9:b 2017-12-12 2016-12-12 26


选择最后一列以在 dt1 中创建新的 summed12m 列。


< h3>其他说明

OP询问了 V1 V2 在哪里,和 V3 来自。


表达式。(ur,日期,日期%m-%个月( 12 ))动态创建一个新的data.table。 (。() list。)的缩写。 $ c>)。由于未指定任何列名,因此 data.table 创建默认列名 V1 V2 等。


草率地减少,可以使用明确命名的列重写表达式

  dt1 [。(缸=缸,结束=日期,开始=日期%m-%months(12)),
on =。(缸,日期< =结束,日期> =开始),
总和(金额),由= .EACHI]


I have a dataset like the following:

library(data.table)    
dt1 <- data.table(urn = c(rep("a", 5), rep("b", 4)),
                  amount = c(10, 12, 23, 15, 19, 42, 11, 5, 10),
                  date = as.Date(c("2016-01-01", "2017-01-02", "2017-02-04",
                                   "2017-04-19", "2018-02-11", "2016-02-14",
                                   "2017-05-06", "2017-05-12", "2017-12-12")))
dt1
#    urn amount       date
# 1:   a     10 2016-01-01
# 2:   a     12 2017-01-02
# 3:   a     23 2017-02-04
# 4:   a     15 2017-04-19
# 5:   a     19 2018-02-11
# 6:   b     42 2016-02-14
# 7:   b     11 2017-05-06
# 8:   b      5 2017-05-12
# 9:   b     10 2017-12-12

I am trying to determine the cumulative value for a group over the preceding 12 months. I know I can use shift with data.table to scan backwards or forwards, the biggest challenge I can't get my head around is how to know how many records to sum when the number can change based on how many records each urn has.

The type of results I am looking for are:

dt1
#    urn amount       date summed12m
# 1:   a     10 2016-01-01        10
# 2:   a     12 2017-01-02        12
# 3:   a     23 2017-02-04        35
# 4:   a     15 2017-04-19        50
# 5:   a     19 2018-02-11        34
# 6:   b     42 2016-02-14        42
# 7:   b     11 2017-05-06        11
# 8:   b      5 2017-05-12        16
# 9:   b     10 2017-12-12        26   

I'm preferably looking for a data.table solution due to the volume of my data, but am open to other options too if it is likely to be efficient over a table with about 12M records.

解决方案

As alternative to foverlaps(), this also can be solved by aggregating in a non-equi join:

library(lubridate)
dt1[, summed12m := dt1[.(urn, date, date %m-% months(12)), 
                       on = .(urn = V1, date <= V2, date >= V3), 
                       sum(amount), by = .EACHI]$V1][]

   urn amount       date summed12m
1:   a     10 2016-01-01        10
2:   a     12 2017-01-02        12
3:   a     23 2017-02-04        35
4:   a     15 2017-04-19        50
5:   a     19 2018-02-11        34
6:   b     42 2016-02-14        42
7:   b     11 2017-05-06        11
8:   b      5 2017-05-12        16
9:   b     10 2017-12-12        26

lubridate is used for date arithmetic to avoid mishaps in case one of the dates is February, 29.

The essential part is the non-equi join

dt1[.(urn, date, date %m-% months(12)), 
    on = .(urn = V1, date <= V2, date >= V3), 
    sum(amount), by = .EACHI]

   urn       date       date V1
1:   a 2016-01-01 2015-01-01 10
2:   a 2017-01-02 2016-01-02 12
3:   a 2017-02-04 2016-02-04 35
4:   a 2017-04-19 2016-04-19 50
5:   a 2018-02-11 2017-02-11 34
6:   b 2016-02-14 2015-02-14 42
7:   b 2017-05-06 2016-05-06 11
8:   b 2017-05-12 2016-05-12 16
9:   b 2017-12-12 2016-12-12 26

of which the last column is picked to create the new summed12m column in dt1.

Additional explanation

The OP has asked where V1, V2, and V3 come from.

The expression .(urn, date, date %m-% months(12)) creates a new data.table on the fly. (.() is an data.table abbreviation for list()). As no column names have been specified, data.table creates default column names V1, V2, etc.

Less sloppily, the expression can be re-written with explicitely named columns

dt1[.(urn = urn, end = date, start = date %m-% months(12)), 
    on = .(urn, date <= end, date >= start), 
    sum(amount), by = .EACHI]

这篇关于R data.table使用日期的组子集的总和的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆