滚动窗口功能可处理不规则的时间序列,可以处理重复项 [英] Rolling window function for irregular time series that can handle duplicates
问题描述
我有以下data.frame:
I have the following data.frame:
grp nr yr
1: A 1.0 2009
2: A 2.0 2009
3: A 1.5 2009
4: A 1.0 2010
5: B 3.0 2009
6: B 2.0 2010
7: B NA 2011
8: C 3.0 2014
9: C 3.0 2019
10: C 3.0 2020
11: C 4.0 2021
所需的输出:
grp nr yr nr_roll_period_3
1 A 1.0 2009 NA
2 A 2.0 2009 NA
3 A 1.5 2009 NA
4 A 1.0 2010 NA
5 B 3.0 2009 NA
6 B 2.0 2010 NA
7 B NA 2011 NA
8 C 3.0 2014 NA
9 C 3.0 2019 NA
10 C 3.0 2020 NA
11 C 4.0 2021 3.333333
逻辑:
- 我想计算长度为k的周期的滚动平均值(假设3 ),其中3包括当前的月/年/日(按组)
- 但是,在连续3年/月/日不连续的情况下,这不应计算任何内容
- 同样,只要在此期间内要计算的列中不存在NA,则输出应为NA。
当前,我具有以下功能:
Currently I have this function:
calculate_rolling_window <-
function(dt, date_col, calc_col, id, k) {
require(data.table)
return(setDT(dt)[
, paste(calc_col, "roll_period", k, sep = "_") :=
sapply(get(date_col), function(x) mean(get(calc_col)[between(get(date_col), x - k + 1, x)])),
by = mget(id)])
}
对于常规情况(在日期列中没有重复项),它可以正常工作。但是,如果重复,则失败:
It works fine for the regular cases, where there is no duplicates in the date column. However, with duplicates it fails:
grp nr yr nr_roll_period_3
1: A 1.0 2009 1.500000
2: A 2.0 2009 1.500000
3: A 1.5 2009 1.500000
4: A 1.0 2010 1.375000
5: B 3.0 2009 NA
6: B 2.0 2010 NA
7: B NA 2011 NA
8: C 3.0 2014 NA
9: C 3.0 2019 NA
10: C 3.0 2020 NA
11: C 4.0 2021 3.333333
关于如何处理此问题的任何想法?不需要专门的 data.table
方法。
Any ideas on how to handle this? No need for exclusively data.table
approach.
推荐答案
通过分组为非等额联接来解决,以汇总滚动长度为 k
的滚动窗口,并过滤出 k
连续几年,并进行 update join :
This can be solved by grouping in a non-equi join to aggregate over a rolling window of length k
, filtering for k
consecutive years, and an update join:
library(data.table)
k <- 3L
# group by join parameters of a non-equi join
mDT <- setDT(DT)[.(grp = grp, upper = yr, lower = yr - k),
on = .(grp, yr <= upper, yr > lower),
.(uniqueN(x.yr), mean(nr)), by = .EACHI]
# update join with filtered intermediate result
DT[mDT[V1 == k], on = .(grp, yr), paste0("nr_roll_period_", k) := V2]
DT
返回OP的预期结果:
grp nr yr nr_roll_period
1: A 1.0 2009 NA
2: A 2.0 2009 NA
3: A 1.5 2009 NA
4: A 1.0 2010 NA
5: B 3.0 2009 NA
6: B 2.0 2010 NA
7: B NA 2011 NA
8: C 3.0 2014 NA
9: C 3.0 2019 NA
10: C 3.0 2020 NA
11: C 4.0 2021 3.333333
中间结果 mDT
包含滚动平均值 V2
在<$ c $上c> k 个周期,以及每个周期内唯一/不同年份 V1
的计数。它是由 DT
的非公平联接创建的,带有data.table的数据表包含实时创建的上限和下限通过。(grp = grp,上限= yr,下限= yr-k)
。
The intermediate result mDT
contains the rolling mean V2
over k
periods and the count of unique/distinct years V1
within each period. It is created by a non-equi join of DT
with a data.table containing the upper and lower bounds which is created on-the-fly by .(grp = grp, upper = yr, lower = yr - k)
.
mDT
grp yr yr V1 V2
1: A 2009 2006 1 1.500000
2: A 2009 2006 1 1.500000
3: A 2009 2006 1 1.500000
4: A 2010 2007 2 1.375000
5: B 2009 2006 1 3.000000
6: B 2010 2007 2 2.500000
7: B 2011 2008 3 NA
8: C 2014 2011 1 3.000000
9: C 2019 2016 1 3.000000
10: C 2020 2017 2 3.000000
11: C 2021 2018 3 3.333333
这是针对完全包含 k
distinct 年的行进行过滤的:
This is filtered for rows which contain exactly k
distinct years:
mDT[V1 == k]
grp yr yr V1 V2
1: B 2011 2008 3 NA
2: C 2021 2018 3 3.333333
最后,这与 DT
将新列附加到 DT
。
Finally, this is joined with DT
to append the new column to DT
.
请注意,的意思是( )
如果输入数据中有 NA
,则默认返回 NA
。
Note, that mean()
returns NA
by default if there is an NA
in the input data.
library(data.table)
DT <- fread(text = "rn grp nr yr
1: A 1.0 2009
2: A 2.0 2009
3: A 1.5 2009
4: A 1.0 2010
5: B 3.0 2009
6: B 2.0 2010
7: B NA 2011
8: C 3.0 2014
9: C 3.0 2019
10: C 3.0 2020
11: C 4.0 2021", drop = 1L)
这篇关于滚动窗口功能可处理不规则的时间序列,可以处理重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!