如何在R中的data.table中根据条件汇总滞后时间数据? [英] How can I roll up lagged time data given conditions in a data.table in R?

查看:97
本文介绍了如何在R中的data.table中根据条件汇总滞后时间数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对R还是很陌生,并且已经完成了一些教程.我想做的是找到一种基于某些条件将数据连接到自身的好方法.

I'm fairly new to R and have gone through some tutorials. What I'd like to do is find a good method of joining data onto itself based on some conditions.

在这种情况下,我要选择的是任意长度的滞后并创建滚动窗口.例如,如果滞后= 1且窗口宽度= 2,我想汇总每个月前1个月的2个月(如果存在).

In this case what I want to do is pick an arbitrary length of lag and create a rolling window. For example, if the lag = 1 and the window width = 2, I want to roll up the 2 months that are 1 month previous for each month, if they exist.

如果我从这样的数据表开始:

If I start with a data table like this:

mytable = data.table(Month = c(6, 5, 4, 6, 5), Year = c(2016, 2016, 2016, 2016, 2016), Company = c('Kellog', 'Kellog', 'General Mills', 'General Mills', 'General Mills'), ProducedCereals = c(6, 3, 12, 5, 7), CommercialsShown = c(12, 15, 4, 20, 19))

Month Year Company   ProducedCereals CommercialsShown
  6   2016  Kellog         6              12
  5   2016  Kellog         3              15
  4   2016  Kellog        12               4
  6   2016  General Mills  5              20
  5   2016  General Mills  7              19

包含计算字段的表可能如下所示:

The table with the calculated fields might look like this:

Month Year Company   ProducedCereals CommercialsShown
  6   2016  Kellog        15              19
  5   2016  Kellog        12               4
  4   2016  Kellog        NA              NA
  6   2016  General Mills  7              19
  5   2016  General Mills NA              NA

我尝试使用列表宽度的rollapply(),但是它似乎取决于数据是常规时间序列.但是,我的不是.它需要按公司分组,并且某些行可能会丢失.进一步需要根据月"和年"字段获取前n行.

I've tried rollapply() with a width of a list, but it seems to be contingent on the data being a regular time series. However, mine isn't. It needs to be grouped by Company, and some rows might be missing. It further needs to take the previous n rows based on the Month and Year fields.

我意识到一种解决方法可能是渲染数据,以便对每个Company子集执行该操作,并在中间丢失几个月的情况下注入虚拟数据,但是我认为可能存在更好的方法.

I realize a workaround might be to render the data so the operation is performed for each Company subset and inject dummy data for months missing in the middle, but I was thinking a better way probably exists.

我尝试了以下方法,该方法适用滞后和滚动窗口,但不考虑月份,年份和公司.

I tried the following approach, which applies a lag and rolling window, but without respect to the month, year, and company.

newthing <- lapply(mytable[,c('ProducedCereals'),with=F], function(x) rollapply(x, width=list(2:3),sum,align='left',fill=NA))

推荐答案

1)使用注释末尾定义的数据,如下所示使用rollapply. nms是要在其上执行滚动窗口计算的列的名称,也可以仅将其指定为列索引(即nms <- 4:5). Sum类似于sum,不同之处在于,如果给出的序列完全为NA,则它将返回NA,而不是0,否则它将执行sum(X, na.rm = TRUE).请注意,roll中添加的NA值是为了使序列不小于窗口宽度.

1) Using the data defined in the Note at the end use rollapply as shown below. nms is the names of the columns to perform the rolling window calculation over or it could be specified as just the column indexes (i.e. nms <- 4:5). Sum is like sum except that it will return NA, instead of 0, if given a series which is entirely NA and otherwise it performs sum(X, na.rm = TRUE). Note that the NA values added in roll are so that the series is not shorter than the window width.

library(data.table)
library(zoo)

k <- 2 # prior two months

Sum <- function(x) if (all(is.na(x))) NA else sum(x, na.rm = TRUE)
roll <- function(x) rollapply(c(x, rep(NA, k)), list(1:k), Sum)
nms <- names(mytable)[4:5]

mytable[, (nms) := lapply(.SD, roll), .SDcols = nms, by = "Company"]

给予:

> mytable
   Month Year       Company ProducedCereals CommercialsShown
1:     6 2016        Kellog              15               19
2:     5 2016        Kellog              12                4
3:     4 2016        Kellog              NA               NA
4:     6 2016 General Mills               7               19
5:     5 2016 General Mills              NA               NA

1a)在评论中提到了以下情况:缺少行,并且仅使用当前行之前的最近两个日历月,因此在其中使用少于2行任何数目.

1a) In a comment the situation is mentiond where there are missing rows and only the most recent two calendar months prior to the current row are to be used so fewer than 2 rows might be used in any sum.

在这种情况下,首先按Company的顺序对数据框进行排序,然后按升序对日期进行排序将很方便,这意味着我们希望右对齐而不是在rollapply中向左对齐.

It will be convenient in this case to sort the data frame first in order of Company and then date in ascending order which implies that we want right alignment rather than left in rollapply.

我们将带yearmon索引的Zoo对象传递给rollapply,以便我们有一个时间索引,Sum可以检查该时间索引以将输入子集化为所需的窗口.我们使用的窗口大小为3,并且只对时间在指定范围内的窗口中的值求和.我们将coredata = FALSE设置为rollapply,以便将数据和索引(不仅是数据)传递给rollapply函数.

We pass a zoo object with yearmon index to rollapply so that we have a time index that Sum can check to subset the input to the desired window. We use a window size of 3 and only sum the values in the window whose times lie within specified bounds. We will specify coredata = FALSE to rollapply in order that the data and index be passed to the rollapply function and not just the data.

k <- 2 # prior 2 months

# inputs zoo object x, subsets it to specified window and sums
Sum2 <- function(x) {
  w <- window(x, start = end(x) - k/12, end = end(x) - 1/12)
  if (length(w) == 0 || all(is.na(w))) NA_real_ else sum(w, na.rm = TRUE)
}

nms <- names(mytable)[4:5]

setkey(mytable, Company, Year, Month) # sort

# create zoo object from arguments and run rollapplyr using Sum2
roll2 <- function(x, year, month) {
  z <- zoo(x, as.yearmon(year + (month - 1)/12))
  coredata(rollapplyr(z, k+1, Sum2, coredata = FALSE, partial = TRUE))
}

mytable[, (nms) := lapply(.SD, roll2, Year, Month), .SDcols = nms, by = "Company"]

给予:

> mytable
    Month Year       Company ProducedCereals CommercialsShown
1:     5 2016 General Mills              NA               NA
2:     6 2016 General Mills               7               19
3:     4 2016        Kellog              NA               NA
4:     5 2016        Kellog              12                4
5:     6 2016        Kellog              15              

1b)另一种丢失行的方法是将数据转换为长格式,然后转换为矩形,以NA填充丢失的单元格.只要每个公司都不会缺少相同的月份和年份,就可以使用该功能.

1b) Another approach to missing rows is to conver the data to long form and then to a rectangular form filling in missing cells with NA. That will work as long as the same month and year is not missing in every company.

k <- 2 # sum over k prior months
m <- melt(mytable, id = 1:3)
dd <- as.data.frame.table(tapply(m$value, m[, 1:4, with = FALSE], c), 
    responseName = "value")
Sum1 <- function(x) {
   x <- head(x, -1)
   if (length(x) == 0 || all(is.na(x))) NA_real_ else sum(x, na.rm = TRUE)
}
setDT(dd)[, value := rollapplyr(value, k+1, Sum1, partial = TRUE), 
     by = .(Company, variable)]
dc <- as.data.table(dcast(... ~ variable, data = dd, value = "value"))
setkey(dc, Company, Year, Month)
dc

给予:

   Month Year       Company ProducedCereals CommercialsShown
1:     4 2016 General Mills              NA               NA
2:     5 2016 General Mills              NA               NA
3:     6 2016 General Mills               7               19
4:     4 2016        Kellog              NA               NA
5:     5 2016        Kellog              12                4
6:     6 2016        Kellog              15               19

2)另一种可能性是将mytable转换为按公司划分的mytable动物园对象z,然后在其上使用rollapply. mytable再次如结尾处的注释所示. Sum来自(1).

2) Another possibility is to convert mytable to the zoo object z splitting mytable by Company and then use rollapply on that. mytable is again as shown in the Note at the end. Sum is from (1).

k <- 2 # prior 2 months

ym <- function(m, y) as.yearmon(paste(m, y), format = "%m %Y")
z <- read.zoo(mytable, index = 1:2, split = k+1, FUN = ym)

Sum <- function(x) if (all(is.na(x))) NA else sum(x, na.rm = TRUE)
rollapply(z, list(-1:-k), Sum, partial = TRUE, fill = NA) 

给予:

         ProducedCereals.General Mills CommercialsShown.General Mills
Apr 2016                            NA                             NA
May 2016                            NA                             NA
Jun 2016                             7                             19
         ProducedCereals.Kellog CommercialsShown.Kellog
Apr 2016                     NA                      NA
May 2016                     12                       4
Jun 2016                     15                      19

注意:问题中的代码不会生成问题中显示的数据,因此我们将其用于数据.表mytable:

Note: The code in the question does not generate the data displayed in the question so we used this instead for the data.table mytable:

library(data.table)
mytable <-
structure(list(Month = c(6, 5, 4, 6, 5), Year = c(2016, 2016, 
2016, 2016, 2016), Company = c("Kellog", "Kellog", "Kellog", 
"General Mills", "General Mills"), ProducedCereals = c(6, 3, 
12, 5, 7), CommercialsShown = c(12, 15, 4, 20, 19)), .Names = c("Month", 
"Year", "Company", "ProducedCereals", "CommercialsShown"), row.names = c(NA, 
-5L), class = "data.frame")
mytable <- as.data.table(mytable)

这篇关于如何在R中的data.table中根据条件汇总滞后时间数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆