改善data.table中滚动平均值的使用 [英] Improve rolling mean usage in data.table

查看:78
本文介绍了改善data.table中滚动平均值的使用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将一个函数复制在一起

I'm trying to put together a function which will replicate the following

library(tidyverse)
library(magrittr)
library(data.table)
library(parallel)
library(RcppRoll)

windows <- (1:10)*600

df2 <- setDT(df_1, key=c("Match","Name"))[
  ,by=.(Match, Name), paste0("Period_", 1:10)
  := mclapply((1:10)*600, function(x) roll_mean(Dist, x))][]

它根据分配给 windows
的值创建一个滚动平均值,但是我有一个工作函数可以复制它,我觉得有一种更好的方法,因为函数版本处理数据的时间将近30倍

It creates a rolling average based off the values assigned to windows I have a working function which replicates it however, I have a feeling there's a better way of doing it as the function version takes almost 30 times longer to process the data

dt_rolling <- function(df, the.keys, x, y, z, window){
  df <- data.table(df)
  setkeyv(df, the.keys) 
  df[,by=.(x,y), paste0("Period_", window) := mclapply(window, function(a) roll_mean(z, a))][]
}


df2 <- dt_rolling(df_1, the.keys=c('Match', 'Name'), df_1$Match, df_1$Name, df_1$Dist, windows)

所讨论的数据如下

> dput(head(df_1, 5))
structure(list(Match = c("BathH", "BathH", "BathH", "BathH", 
"BathH"), Name = c("Alafoti Faosiliva", "Alafoti Faosiliva", 
"Alafoti Faosiliva", "Alafoti Faosiliva", "Alafoti Faosiliva"
), Dist = c(0, 0, 0, 0, 0), Period_1 = c(NA_real_, NA_real_, 
NA_real_, NA_real_, NA_real_), Period_2 = c(NA_real_, NA_real_, 
NA_real_, NA_real_, NA_real_), Period_3 = c(NA_real_, NA_real_, 
NA_real_, NA_real_, NA_real_), Period_4 = c(NA_real_, NA_real_, 
NA_real_, NA_real_, NA_real_), Period_5 = c(NA_real_, NA_real_, 
NA_real_, NA_real_, NA_real_), Period_6 = c(NA_real_, NA_real_, 
NA_real_, NA_real_, NA_real_), Period_7 = c(NA_real_, NA_real_, 
NA_real_, NA_real_, NA_real_), Period_8 = c(NA_real_, NA_real_, 
NA_real_, NA_real_, NA_real_), Period_9 = c(NA_real_, NA_real_, 
NA_real_, NA_real_, NA_real_), Period_10 = c(NA_real_, NA_real_, 
NA_real_, NA_real_, NA_real_), Period_600 = c(NA_real_, NA_real_, 
NA_real_, NA_real_, NA_real_), Period_1200 = c(NA_real_, NA_real_, 
NA_real_, NA_real_, NA_real_), Period_1800 = c(NA_real_, NA_real_, 
NA_real_, NA_real_, NA_real_), Period_2400 = c(NA_real_, NA_real_, 
NA_real_, NA_real_, NA_real_), Period_3000 = c(NA_real_, NA_real_, 
NA_real_, NA_real_, NA_real_), Period_3600 = c(NA_real_, NA_real_, 
NA_real_, NA_real_, NA_real_), Period_4200 = c(NA_real_, NA_real_, 
NA_real_, NA_real_, NA_real_), Period_4800 = c(NA_real_, NA_real_, 
NA_real_, NA_real_, NA_real_), Period_5400 = c(NA_real_, NA_real_, 
NA_real_, NA_real_, NA_real_), Period_6000 = c(NA_real_, NA_real_, 
NA_real_, NA_real_, NA_real_)), sorted = c("Match", "Name"), class =     c("data.table", 
"data.frame"), row.names = c(NA, -5L), .internal.selfref = <pointer:   0x10280cae0>)

它可以扩展到超过2000万行,所以这就是为什么我在这里使用 data.table 方法的原因一起调查将其更改为函数

It can extend to over 20 million rows so that's why I'm using a data.table approach here along with investigating changing it to a function

编辑:

以下有关@ $ c $的加法@jangorecki的回答c> data.table :: frollmean()我将 frollmean 与基于 Rcpp 的内容进行了比较在具有1,500,000行的数据集上使用 microbenchmark 滚动平均值函数。

Following @jangorecki's answer below regarding the addition of data.table::frollmean() I compared frollmean to a Rcpp based rolling average function using microbenchmark on a dataset with 1,500,000 rows.

Unit: seconds
       expr      min       lq     mean   median       uq      max neval cld
       rcpp 1.056967 1.224827 1.374116 1.304310 1.467108 5.855003  1000  a 
 data.table 1.096122 1.306993 1.466128 1.389878 1.549299 9.287606  1000   b

推荐答案

从v1.12.0版本开始,data.table中提供了快速滚动平均值。

以下查询将解决您的问题。

Fast rolling mean is available in data.table since v1.12.0 version.
Following query will address your question.

df_1[, paste0("Period_", windows) := frollmean(Dist, windows)]

这篇关于改善data.table中滚动平均值的使用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆