R data.table使用日期的组子集的总和 [英] R data.table sum of group subset using dates
问题描述
我有一个像下面这样的数据集:
library(data.table)
$我试图确定一个组在过去12个月中的累计价值。我知道我可以在
dt1 <-data .table(urn = c(rep( a,5),rep( b,4)),
金额= c(10,12,23,15,19,42,11,5, 10),
日期= as.Date(c( 2016-01-01, 2017-01-02, 2017-02-04,
2017-04-19 , 2018-02-11, 2016-02-14,
2017-05-06, 2017-05-12, 2017-12-12)))
dt1
#amount金额日期
#1:a 10 2016-01-01
#2:a 12 2017-01-02
#3:a 23 2017-02 -04
#4:15 2017-04-19
#5:a 19 2018-02-11
#6:b 42 2016-02-14
#7 :b 11 2017-05-06
#8:b 5 2017-05-12
#9:b 10 2017-12-12
data.table
中使用shift
进行向后或向前扫描,这是我无法获得的最大挑战我的想法是如何根据每条urn
拥有多少条记录来知道何时可以更改数目的总记录数。
我正在寻找的结果类型为:
dt1
#ur金额日期总计1200万美元$ b#1:a 10 2016-01-01 10
#2:a 12 2017-01-02 12
#3:a 23 2017-02-04 35
#4: a 15 2017-04-19 50
#5:a 19 2018-02-11 34
#6:b 42 2016-02-14 42
#7:b 11 2017-05 -06 11
#8:b 5 2017-05-12 16
#9:b 10 2017-12-12 26
由于我的数据量大,我最好寻找一个
data.table
解决方案,但愿意接受其他选择如果在大约1200万个记录的表上可能有效率的话,也是如此解决方案作为
foverlaps()
的替代方案,通过聚集非等额联接来解决:library(lubridate)
dt1 [,summed12m := dt1 [。(urn,date,date%m-%months(12)),
on =。(urn = V1,date< = V2,date> = V3),
sum(amount),按= .EACHI] $ V1] []
amount金额日期总计1200万美元b $ b 1:a 10 2016-01-01 10
2:a 12 2017-01-02 12
3:a 23 2017-02- 04 35
4:15 2017-04-19 50
5:a 19 2018-02-11 34
6:b 42 2016-02-14 42
7: b 11 2017-05-06 11
8:b 5 2017-05-12 16
9:b 10 2017-12-12 26
lubridate
用于日期算术,以防万一其中一个日期是2月ruary,29岁。
必不可少的部分是非公平加入
dt1 [。(urn,date,date%m-%months(12)),
on =。(urn = V1,date< = V2,date> = V3),
之和(金额),由= .EACHI]
ur日期日期V1
1:a 2016-01-01 2015-01-01 10
2:a 2017-01-02 2016-01-02 12
3:a 2017-02 -04 2016-02-04 35
4:a 2017-04-19 2016-04-19 50
5:a 2018-02-11 2017-02-11 34
6: b 2016-02-14 2015-02-14 42
7:b 2017-05-06 2016-05-06 11
8:b 2017-05-12 2016-05-12 16
9:b 2017-12-12 2016-12-12 26
选择最后一列以在
dt1
中创建新的summed12m
列。
< h3>其他说明
OP询问了
V1
,V2
在哪里,和V3
来自。
表达式
。(ur,日期,日期%m-%个月( 12 ))
动态创建一个新的data.table。 (。()
是list。)$ c $ data.table
的缩写。 $ c>)。由于未指定任何列名,因此data.table
创建默认列名V1
,V2
等。
草率地减少,可以使用明确命名的列重写表达式
dt1 [。(缸=缸,结束=日期,开始=日期%m-%months(12)),
on =。(缸,日期< =结束,日期> =开始),
总和(金额),由= .EACHI]
I have a dataset like the following:
library(data.table) dt1 <- data.table(urn = c(rep("a", 5), rep("b", 4)), amount = c(10, 12, 23, 15, 19, 42, 11, 5, 10), date = as.Date(c("2016-01-01", "2017-01-02", "2017-02-04", "2017-04-19", "2018-02-11", "2016-02-14", "2017-05-06", "2017-05-12", "2017-12-12"))) dt1 # urn amount date # 1: a 10 2016-01-01 # 2: a 12 2017-01-02 # 3: a 23 2017-02-04 # 4: a 15 2017-04-19 # 5: a 19 2018-02-11 # 6: b 42 2016-02-14 # 7: b 11 2017-05-06 # 8: b 5 2017-05-12 # 9: b 10 2017-12-12
I am trying to determine the cumulative value for a group over the preceding 12 months. I know I can use
shift
withdata.table
to scan backwards or forwards, the biggest challenge I can't get my head around is how to know how many records to sum when the number can change based on how many records eachurn
has.The type of results I am looking for are:
dt1 # urn amount date summed12m # 1: a 10 2016-01-01 10 # 2: a 12 2017-01-02 12 # 3: a 23 2017-02-04 35 # 4: a 15 2017-04-19 50 # 5: a 19 2018-02-11 34 # 6: b 42 2016-02-14 42 # 7: b 11 2017-05-06 11 # 8: b 5 2017-05-12 16 # 9: b 10 2017-12-12 26
I'm preferably looking for a
data.table
solution due to the volume of my data, but am open to other options too if it is likely to be efficient over a table with about 12M records.解决方案As alternative to
foverlaps()
, this also can be solved by aggregating in a non-equi join:library(lubridate) dt1[, summed12m := dt1[.(urn, date, date %m-% months(12)), on = .(urn = V1, date <= V2, date >= V3), sum(amount), by = .EACHI]$V1][]
urn amount date summed12m 1: a 10 2016-01-01 10 2: a 12 2017-01-02 12 3: a 23 2017-02-04 35 4: a 15 2017-04-19 50 5: a 19 2018-02-11 34 6: b 42 2016-02-14 42 7: b 11 2017-05-06 11 8: b 5 2017-05-12 16 9: b 10 2017-12-12 26
lubridate
is used for date arithmetic to avoid mishaps in case one of the dates is February, 29.The essential part is the non-equi join
dt1[.(urn, date, date %m-% months(12)), on = .(urn = V1, date <= V2, date >= V3), sum(amount), by = .EACHI]
urn date date V1 1: a 2016-01-01 2015-01-01 10 2: a 2017-01-02 2016-01-02 12 3: a 2017-02-04 2016-02-04 35 4: a 2017-04-19 2016-04-19 50 5: a 2018-02-11 2017-02-11 34 6: b 2016-02-14 2015-02-14 42 7: b 2017-05-06 2016-05-06 11 8: b 2017-05-12 2016-05-12 16 9: b 2017-12-12 2016-12-12 26
of which the last column is picked to create the new
summed12m
column indt1
.Additional explanation
The OP has asked where
V1
,V2
, andV3
come from.The expression
.(urn, date, date %m-% months(12))
creates a new data.table on the fly. (.()
is andata.table
abbreviation forlist()
). As no column names have been specified,data.table
creates default column namesV1
,V2
, etc.Less sloppily, the expression can be re-written with explicitely named columns
dt1[.(urn = urn, end = date, start = date %m-% months(12)), on = .(urn, date <= end, date >= start), sum(amount), by = .EACHI]
这篇关于R data.table使用日期的组子集的总和的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!