基于 R 中的条件的窗口中的累积总和(或运行窗口总和) [英] Cumulative sum in a window (or running window sum) based on a condition in R
问题描述
我正在尝试根据条件计算给定窗口的累积总和.我见过解决方案执行条件累积总和的线程(为数据框中的每一行计算 R 中的条件运行总和)和滚动总和(R 中另一个变量的滚动求和),但我找不到两者在一起.我还在 data.table 没有滚动窗口功能>R data.table 滑动窗口.所以,这个问题对我来说很有挑战性.
I am trying to calculate cumulative sum for a given window based on a condition. I have seen threads where the solution does conditional cumulative sum (Calculate a conditional running sum in R for every row in data frame) and rolling sum (Rolling Sum by Another Variable in R), but I couldn't find the two together. I also saw that data.table
doesn't have a rolling window function at R data.table sliding window. So, this problem is very challenging for me.
此外,Mike Grahan 发布的关于滚动总和的解决方案超出了我的理解.我正在寻找基于 data.table
的方法,主要是为了速度.但是,如果可以理解,我对其他方法持开放态度.
Moreover, the solution posted by Mike Grahan on rolling sum is beyond my comprehension. I am looking for data.table
based method primarily for speed. However, I am open to other methods if they are understandable.
这是我的输入数据:
DFI <- structure(list(FY = c(2011, 2012, 2013, 2015, 2016, 2011, 2011,
2012, 2013, 2014, 2015, 2010, 2016, 2013, 2014, 2015, 2010),
Customer = c(13575, 13575, 13575, 13575, 13575, 13575, 13575,
13575, 13575, 13575, 13575, 13578, 13578, 13578, 13578, 13578,
13578), Product = c("A", "A", "A", "A", "A", "B", "B", "B",
"B", "B", "B", "A", "A", "B", "C", "D", "E"), Rev = c(4,
3, 3, 1, 2, 1, 2, 3, 4, 5, 6, 3, 2, 2, 4, 2, 2)), .Names = c("FY",
"Customer", "Product", "Rev"), row.names = c(NA, 17L), class = "data.frame")
这是我的预期输出:(手动创建;如果出现手动错误,我深表歉意)
Here's my expected output: (Manually created; My apologies if there is a manual error)
DFO <- structure(list(FY = c(2011, 2012, 2013, 2015, 2016, 2011, 2012,
2013, 2014, 2015, 2010, 2016, 2013, 2014, 2015, 2010), Customer = c(13575,
13575, 13575, 13575, 13575, 13575, 13575, 13575, 13575, 13575,
13578, 13578, 13578, 13578, 13578, 13578), Product = c("A", "A",
"A", "A", "A", "B", "B", "B", "B", "B", "A", "A", "B", "C", "D",
"E"), Rev = c(4, 3, 3, 1, 2, 3, 3, 4, 5, 6, 3, 2, 2, 4, 2, 2),
cumsum = c(4, 7, 10, 11, 9, 3, 6, 10, 15, 21, 3, 2, 2, 4,
2, 2)), .Names = c("FY", "Customer", "Product", "Rev", "cumsum"
), row.names = c(NA, 16L), class = "data.frame")
关于逻辑的一些评论:
1) 我想在 5 年内找到滚动总和.理想情况下,我希望这个 5 年期间是可变的,即我可以在代码的其他地方指定的东西.通过这种方式,我可以自由地稍后更改窗口以进行分析.
1) I want to find rolling sum in a 5-year period. Ideally, I would like this 5-year period to be variable i.e. something I can specify elsewhere in the code. This way, I have the liberty to vary the window later on for my analysis.
2) Window 的结束基于最大年份(即上面示例中的 FY
).在上面的例子中,DFI
中的最大 FY
是 2016
.因此,对于 2016
中的所有条目,窗口的起始年份将为 2016 - 5 + 1 = 2012
.
2) The end of Window is based on the maximum year (i.e. FY
in example above). In above example, the max FY
in DFI
is 2016
. So, starting year of the window would be 2016 - 5 + 1 = 2012
for all entries in 2016
.
3) 窗口总和(或运行总和)由 Customer
和特定 Product
计算.
3) The window sum (or running sum) is calculated by Customer
and for a specific Product
.
我的尝试:
我想在发布之前尝试一些东西.这是我的代码:
I wanted to try something before posting. Here's my code:
DFI <- data.table::as.data.table(DFI)
#Sort it first
DFI<-DFI[order(Customer,FY),]
#find cumulative sum; remove Rev column; order rows
DFOTest<-DFI[,cumsum := cumsum(Rev),by=.(Customer,Product)][,.SD[which.max(cumsum)],by=.(FY,Customer,Product)][,("Rev"):=NULL][order(Customer,Product,FY)]
此代码计算累积总和,但我无法定义 5 年窗口,然后计算运行总和.我有两个问题:
This code calculates the cumulative sum, but I am unable to define 5-year window and then calculate running sum. I have two questions:
问题 1) 如何计算 5 年的总和?
Question 1) How do I calculate a 5-year running sum?
问题 2) 有人可以解释一下 Mike 在此线程上的方法吗?它似乎很快.但是,我不确定那里发生了什么.我确实看到有人要求提供一些评论,但我不确定它是否不言自明.
Question 2) Can someone please explain Mike's method on this thread ? It seems to be fast. However, I am not really sure what's going on there. I did see that someone requested some commentary, but I am not sure whether it is self-explanatory.
提前致谢.这两天我一直在为这个问题苦苦挣扎.
Thanks in advance. I have been struggling on this problem for two days.
推荐答案
1) rollapply 创建一个 Sum
函数,它接受 FY
和 Rev
作为一个 2 列矩阵(或者如果不是一个矩阵),然后将上一年 k
内这些年的收入相加.然后将 DFI
转换为数据表,对具有相同 Customer/Product/Year 的行求和,并为每个 Customer/Product 组运行 rollapplyr
和 Sum
.
1) rollapply Create a Sum
function which takes FY
and Rev
as a 2 column matrix (or if not makes it one) and then sums the revenues for those years within k
of the last year. Then convert DFI
to a data table, sum rows having the same Customer/Product/Year and run rollapplyr
with Sum
for each Customer/Product group.
library(data.table)
library(zoo)
k <- 5
Sum <- function(x) {
x <- matrix(x,, 2)
FY <- x[, 1]
Rev <- x[, 2]
ok <- FY >= tail(FY, 1) - k + 1
sum(Rev[ok])
}
DT <- as.data.table(DFI)
DT <- DT[, list(Rev = sum(Rev)), by = c("Customer", "Product", "FY")]
DT[, cumsum := rollapplyr(.SD, k, Sum, by.column = FALSE, partial = TRUE),
by = c("Customer", "Product"), .SDcols = c("FY", "Rev")]
给予:
> DT
Customer Product FY Rev cumsum
1: 13575 A 2011 4 4
2: 13575 A 2012 3 7
3: 13575 A 2013 3 10
4: 13575 A 2015 1 11
5: 13575 A 2016 2 9
6: 13575 B 2011 3 3
7: 13575 B 2012 3 6
8: 13575 B 2013 4 10
9: 13575 B 2014 5 15
10: 13575 B 2015 6 21
11: 13578 A 2010 3 3
12: 13578 A 2016 2 2
13: 13578 B 2013 2 2
14: 13578 C 2014 4 4
15: 13578 D 2015 2 2
16: 13578 E 2010 2 2
2) 仅限数据表
首先汇总具有相同客户/产品/FY 的行,然后按客户/产品分组,对于每个 FY 值,fy
,选择 Rev
值其 FY 值介于 fy-k+1
和 fy
和 sum 之间.
First sum rows that have the same Customer/Product/FY and then, grouping by Customer/Product, for each FY value, fy
, pick out the Rev
values whose FY values are between fy-k+1
and fy
and sum.
library(data.table)
k <- 5
DT <- as.data.table(DFI)
DT <- DT[, list(Rev = sum(Rev)), by = c("Customer", "Product", "FY")]
DT[, cumsum := sapply(FY, function(fy) sum(Rev[between(FY, fy-k+1, fy)])),
by = c("Customer", "Product")]
给予:
> DT
Customer Product FY Rev cumsum
1: 13575 A 2011 4 4
2: 13575 A 2012 3 7
3: 13575 A 2013 3 10
4: 13575 A 2015 1 11
5: 13575 A 2016 2 9
6: 13575 B 2011 3 3
7: 13575 B 2012 3 6
8: 13575 B 2013 4 10
9: 13575 B 2014 5 15
10: 13575 B 2015 6 21
11: 13578 A 2010 3 3
12: 13578 A 2016 2 2
13: 13578 B 2013 2 2
14: 13578 C 2014 4 4
15: 13578 D 2015 2 2
16: 13578 E 2010 2 2
这篇关于基于 R 中的条件的窗口中的累积总和(或运行窗口总和)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!