滚动窗口功能可处理不规则的时间序列,可以处理重复项 [英] Rolling window function for irregular time series that can handle duplicates

查看:119
本文介绍了滚动窗口功能可处理不规则的时间序列,可以处理重复项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下data.frame:

I have the following data.frame:

    grp  nr   yr
 1:   A 1.0 2009
 2:   A 2.0 2009
 3:   A 1.5 2009
 4:   A 1.0 2010
 5:   B 3.0 2009
 6:   B 2.0 2010
 7:   B  NA 2011
 8:   C 3.0 2014
 9:   C 3.0 2019
10:   C 3.0 2020
11:   C 4.0 2021

所需的输出:

   grp  nr   yr nr_roll_period_3
1    A 1.0 2009               NA
2    A 2.0 2009               NA
3    A 1.5 2009               NA
4    A 1.0 2010               NA
5    B 3.0 2009               NA
6    B 2.0 2010               NA
7    B  NA 2011               NA
8    C 3.0 2014               NA
9    C 3.0 2019               NA
10   C 3.0 2020               NA
11   C 4.0 2021         3.333333

逻辑:


  • 我想计算长度为k的周期的滚动平均值(假设3 ),其中3包括当前的月/年/日(按组)

  • 但是,在连续3年/月/日不连续的情况下,这不应计算任何内容

  • 同样,只要在此期间内要计算的列中不存在NA,则输出应为NA。

当前,我具有以下功能:

Currently I have this function:

calculate_rolling_window <-

  function(dt, date_col, calc_col, id, k) {

    require(data.table)

    return(setDT(dt)[
      , paste(calc_col, "roll_period", k, sep = "_") := 
        sapply(get(date_col), function(x) mean(get(calc_col)[between(get(date_col), x - k + 1, x)])),
      by = mget(id)])

  }

对于常规情况(在日期列中没有重复项),它可以正常工作。但是,如果重复,则失败:

It works fine for the regular cases, where there is no duplicates in the date column. However, with duplicates it fails:

    grp  nr   yr nr_roll_period_3
 1:   A 1.0 2009         1.500000
 2:   A 2.0 2009         1.500000
 3:   A 1.5 2009         1.500000
 4:   A 1.0 2010         1.375000
 5:   B 3.0 2009               NA
 6:   B 2.0 2010               NA
 7:   B  NA 2011               NA
 8:   C 3.0 2014               NA
 9:   C 3.0 2019               NA
10:   C 3.0 2020               NA
11:   C 4.0 2021         3.333333

关于如何处理此问题的任何想法?不需要专门的 data.table 方法。

Any ideas on how to handle this? No need for exclusively data.table approach.

推荐答案

通过分组为非等额联接来解决,以汇总滚动长度为 k 的滚动窗口,并过滤出 k 连续几年,并进行 update join

This can be solved by grouping in a non-equi join to aggregate over a rolling window of length k, filtering for k consecutive years, and an update join:

library(data.table)
k <- 3L
# group by join parameters of a non-equi join
mDT <- setDT(DT)[.(grp = grp, upper = yr, lower = yr - k), 
                 on = .(grp, yr <= upper, yr > lower), 
                 .(uniqueN(x.yr), mean(nr)), by = .EACHI]
# update join with filtered intermediate result
DT[mDT[V1 == k], on = .(grp, yr), paste0("nr_roll_period_", k) := V2]
DT

返回OP的预期结果:


    grp  nr   yr nr_roll_period
 1:   A 1.0 2009             NA
 2:   A 2.0 2009             NA
 3:   A 1.5 2009             NA
 4:   A 1.0 2010             NA
 5:   B 3.0 2009             NA
 6:   B 2.0 2010             NA
 7:   B  NA 2011             NA
 8:   C 3.0 2014             NA
 9:   C 3.0 2019             NA
10:   C 3.0 2020             NA
11:   C 4.0 2021       3.333333


中间结果 mDT 包含滚动平均值 V2 在<$ c $上c> k 个周期,以及每个周期内唯一/不同年份 V1 的计数。它是由 DT 非公平联接创建的,带有data.table的数据表包含实时创建的上限和下限通过。(grp = grp,上限= yr,下限= yr-k)

The intermediate result mDT contains the rolling mean V2 over k periods and the count of unique/distinct years V1 within each period. It is created by a non-equi join of DT with a data.table containing the upper and lower bounds which is created on-the-fly by .(grp = grp, upper = yr, lower = yr - k).

mDT



    grp   yr   yr V1       V2
 1:   A 2009 2006  1 1.500000
 2:   A 2009 2006  1 1.500000
 3:   A 2009 2006  1 1.500000
 4:   A 2010 2007  2 1.375000
 5:   B 2009 2006  1 3.000000
 6:   B 2010 2007  2 2.500000
 7:   B 2011 2008  3       NA
 8:   C 2014 2011  1 3.000000
 9:   C 2019 2016  1 3.000000
10:   C 2020 2017  2 3.000000
11:   C 2021 2018  3 3.333333


这是针对完全包含 k distinct 年的行进行过滤的:

This is filtered for rows which contain exactly k distinct years:

mDT[V1 == k]



   grp   yr   yr V1       V2
1:   B 2011 2008  3       NA
2:   C 2021 2018  3 3.333333


最后,这与 DT 将新列附加到 DT

Finally, this is joined with DT to append the new column to DT.

请注意,的意思是( )如果输入数据中有 NA ,则默认返回 NA

Note, that mean() returns NA by default if there is an NA in the input data.

library(data.table)
DT <- fread(text = "rn    grp  nr   yr
 1:   A 1.0 2009
 2:   A 2.0 2009
 3:   A 1.5 2009
 4:   A 1.0 2010
 5:   B 3.0 2009
 6:   B 2.0 2010
 7:   B  NA 2011
 8:   C 3.0 2014
 9:   C 3.0 2019
10:   C 3.0 2020
11:   C 4.0 2021", drop = 1L)

这篇关于滚动窗口功能可处理不规则的时间序列,可以处理重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆