如何通过指定的分组来计算移动平均值并处理资产净值 [英] How to calculate moving average by specified grouping and deal with NAs

查看:124
本文介绍了如何通过指定的分组来计算移动平均值并处理资产净值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个data.table,它需要针对指定的分组(ID1,ID2)在前n天的数据(为简单起见,我们使用n = 2,而不是包括当天)来计算移动平均值.移动平均值应尝试包括每个ID1-ID2对的值的最后2天.我想计算移动平均线以两种方式处理NA: 1.仅在有2个非NA观测值时进行计算,否则平均值应为NA(例如,ID1-ID2内的前2天将始终具有NA). 2.根据最近2天内的所有非NA观测值(na.rm = TRUE吗?)计算移动平均值.

I have a data.table which needs a moving average to be calculated on the previous n days of data (let's use n=2 for simplicity, not incl. current day) for a specified grouping (ID1, ID2). The moving average should attempt to include the last 2 days of values for each ID1-ID2 pair. I would like to calculate moving average to handle NAs two separate ways: 1. Only calculate when there are 2 non-NA observations, otherwise avg should be NA (e.g. first 2 days within an ID1-ID2 will always have NAs). 2. Calculate the moving average based on any non-NA observations within the last 2 days (na.rm=TRUE ?).

我尝试使用zoo软件包及其中的各种功能.我已经确定了以下内容(使用shift()排除了平均值中考虑的星期,以相反的顺序显示日期,以突出显示最初并不总是按日期排序):

I've tried to use the zoo package and various functions within it. I've settled on the following (used shift() to exclude the week considered in the avg, put dates in reverse order to highlight dates are not always ordered initially):

library(zoo)
library(data.table)
DATE = rev(rep(seq(as.Date("2018-01-01"),as.Date("2018-01-04"),"day"),4))
VALUE =seq(1,16,1)
VALUE[16] <- NA
ID1 = rep(c("A","B"),each=8)
ID2 = rep(1:2,2,each=4)
testdata = data.frame (DATE, ID1, ID2, VALUE)
setDT(testdata)[order(DATE), VALUE_AVG := shift(rollapplyr(VALUE, 2, mean, 
na.rm=TRUE,fill = NA)), by = c("ID1", "ID2")]

我似乎很难按多列进行分组. VALUE以NA值开头/结尾的分组似乎也引起了问题.我对在data.table框架内有意义的任何解决方案持开放态度,尤其是frollmean(需要更新我的R + data.table版本).我不知道是否需要结合指定的对齐方式(例如正确")对日期进行不同的排序.

I seem to have trouble grouping by multiple columns. Groupings where VALUE begins/ends with NA values also seem to cause issues. I'm open to any solutions which make sense within a data.table framework, especially frollmean (need to update my versions of R + data.table). I don't know if I need to order the dates differently in conjunction with a specified alignment (e.g. "right").

我希望我的输出看起来像以下内容,但按照ID1-ID2分组的最早日期先后排序:

I would hope my output would look something like the following except ordered by oldest date first per ID1-ID2 grouping:

           DATE ID1 ID2 VALUE VALUE_AVG
 1: 2018-01-04   A   1     1       2.5
 2: 2018-01-03   A   1     2       3.5
 3: 2018-01-02   A   1     3        NA
 4: 2018-01-01   A   1     4        NA
 5: 2018-01-04   A   2     5       6.5
 6: 2018-01-03   A   2     6       7.5
 7: 2018-01-02   A   2     7        NA
 8: 2018-01-01   A   2     8        NA
 9: 2018-01-04   B   1     9      10.5
10: 2018-01-03   B   1    10      11.5
11: 2018-01-02   B   1    11        NA
12: 2018-01-01   B   1    12        NA
13: 2018-01-04   B   2    13      14.5
14: 2018-01-03   B   2    14      15.0
15: 2018-01-02   B   2    15        NA
16: 2018-01-01   B   2    NA        NA

我的代码似乎可以大致实现示例数据的预期结果.但是,当尝试在大型数据集上平均运行4个星期(其中ID1和ID2都是整数)时,出现以下错误:

My code seems to roughly achieve the desired results for the sample data. Nevertheless, when trying to run the same code on large dataset for a 4-week average where ID1 and ID2 are both integers, I get the following error:

Error in seq.default(start.at, NROW(data), by = by) : 
  wrong sign in 'by' argument

对于大多数ID1-ID2组合,我的结果似乎正确,但是在ID1的某些特定情况下,VALUE具有前导NA和尾随NA.我猜这是造成问题的原因,尽管上面的示例中没有.

My results seem right for most ID1-ID2 combinations but there are specific cases of ID1 where VALUE has leading and trailing NAs. I'm guessing this is causing the issue, although it hasn't for the example above.

推荐答案

使用shift会使此操作不必要地复杂. rollapply已经可以自己处理了.在rollapplyr中指定:

Using shift complicates this unnecessarily. rollapply already can handle that itself. In rollapplyr specify:

  • list(-seq(2))的宽度,用于指定它应作用于偏移量-1和-2.

  • a width of list(-seq(2)) to specify that it should act on offsets -1 and -2.

partial = TRUE表示如果前面的行少于2行,则将使用其中的任何行.

partial = TRUE to indicate that if there are fewer than 2 prior rows it will use whatever is there.

fill = NA用NA填充空白单元格

fill = NA to fill empty cells with NA

na.rm = TRUE删除所有NA,仅对剩余单元格执行均值.如果先前的单元格全部为NA,则表示给出NaN.

na.rm = TRUE to remove any NAs and only perform the mean on the remaining cells. If the prior cells are all NA then mean gives NaN.

仅考虑存在2个先前给出NA的非NA的情况,否则请删除partial = TRUEna.rm = TRUE自变量.

To only consider situations where there are 2 prior non-NAs giving NA otherwise remove the partial = TRUE and na.rm = TRUE arguments.

取前2行中非NA的平均值,如果前行较少,则取较少的行.

Take mean of non-NAs in prior 2 rows or fewer rows if fewer prior rows.

testdata <- data.table(DATE, ID1, ID2, VALUE, key = c("ID1", "ID2", "DATE"))
testdata[, VALUE_AVG := 
  rollapplyr(VALUE, list(-seq(2)), mean, fill = NA, partial = TRUE, na.rm = TRUE),
  by = c("ID1", "ID2")]
testdata

给予:

          DATE ID1 ID2 VALUE VALUE_AVG
 1: 2018-01-01   A   1     4        NA
 2: 2018-01-02   A   1     3       4.0
 3: 2018-01-03   A   1     2       3.5
 4: 2018-01-04   A   1     1       2.5
 5: 2018-01-01   A   2     8        NA
 6: 2018-01-02   A   2     7       8.0
 7: 2018-01-03   A   2     6       7.5
 8: 2018-01-04   A   2     5       6.5
 9: 2018-01-01   B   1    12        NA
10: 2018-01-02   B   1    11      12.0
11: 2018-01-03   B   1    10      11.5
12: 2018-01-04   B   1     9      10.5
13: 2018-01-01   B   2    NA        NA
14: 2018-01-02   B   2    15       NaN
15: 2018-01-03   B   2    14      15.0
16: 2018-01-04   B   2    13      14.5

第二种情况

如果前2行中的任何一行为NA,或者前行少于2行,则为NA.

Second case

NA if any of the prior 2 rows are NA or if there are fewer than 2 prior rows.

testdata <- data.table(DATE, ID1, ID2, VALUE, key = c("ID1", "ID2", "DATE"))
testdata[, VALUE_AVG := 
  rollapplyr(VALUE, list(-seq(2)), mean, fill = NA),
  by = c("ID1", "ID2")]
testdata

给予:

          DATE ID1 ID2 VALUE VALUE_AVG
 1: 2018-01-01   A   1     4        NA
 2: 2018-01-02   A   1     3        NA
 3: 2018-01-03   A   1     2       3.5
 4: 2018-01-04   A   1     1       2.5
 5: 2018-01-01   A   2     8        NA
 6: 2018-01-02   A   2     7        NA
 7: 2018-01-03   A   2     6       7.5
 8: 2018-01-04   A   2     5       6.5
 9: 2018-01-01   B   1    12        NA
10: 2018-01-02   B   1    11        NA
11: 2018-01-03   B   1    10      11.5
12: 2018-01-04   B   1     9      10.5
13: 2018-01-01   B   2    NA        NA
14: 2018-01-02   B   2    15        NA
15: 2018-01-03   B   2    14        NA
16: 2018-01-04   B   2    13      14.5

这篇关于如何通过指定的分组来计算移动平均值并处理资产净值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆