如何在R中使用并行处理来分析大型时间序列数据集 [英] How to use parallel processing in R to analyze large time series data sets

查看:20
本文介绍了如何在R中使用并行处理来分析大型时间序列数据集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大型时间序列数据集,通过 1800 个时间序列使用顺序处理通常需要 4 小时来处理.我正在寻找一种方法来使用多个内核来减少这个时间,因为我有许多这样的数据集需要定期处理.

I have a large time series data set that normally takes 4 hrs to process using sequential processing through the 1800 time series. I'm looking for a way to use several cores to reduce this time because I have a number of these data sets to get through on a regular basis.

我用于顺序处理的 R 代码如下.有 4 个文件包含不同的数据集,每个文件包含 1800 多个系列.我一直在尝试使用 doParallel 来独立分析每个时间序列并将结果连接到一个文件中.即使是 CSV 文件也可以.

The R code I am using for the sequential processing is below. There are 4 files containing different data set, and each files contains over 1800 series. I have been trying to use doParallel to analyze each time series independently and concatenate the results into a single file. Even a CSV file would do.

# load the dataset
files <- c("3MH Clean", "3MH", "6MH", "12MH")
for (j in 1:4)
{
  title <- paste("\n\n\n Evaluation of", files[j], " - Started at", date(), "\n\n\n")
  cat(title)

  History <- read.csv(paste(files[j],"csv", sep="."))

  # output forecast to XLSX
  outwb <- createWorkbook()
  sheet <- createSheet(outwb, sheetName = paste(files[j], " - ETS"))                                      
  Item <- unique(unlist(History$Item))

 for (i in 1:length(Item))
 {
    title <- paste("Evaluation of item ", Item[i], "-", i, "of", length(Item),"\n")
    cat(title)
    data <- subset(History, Item == Item[i])
    dates <- unique(unlist(data$Date))
    d <- as.Date(dates, format("%d/%m/%Y"))
    data.ts <- ts(data$Volume, frequency=12, start=c(as.numeric(format(d[1],"%Y")), as.numeric(format(d[1],"%m"))))
    try(data.ets <- ets(data.ts))
    try(forecast.ets <- forecast.ets(data.ets, h=24))
    IL <-c(Item[i],Item[i],Item[i],Item[i],Item[i],Item[i],Item[i],Item[i],Item[i],Item[i],Item[i],Item[i],Item[i],Item[i],Item[i],Item[i],Item[i],Item[i],Item[i],Item[i],Item[i],Item[i],Item[i],Item[i])
    ets.df <- data.frame(forecast.ets)
    ets.df$Item <- IL
    r <- 24*(i-1)+2
    addDataFrame(ets.df, sheet, col.names=FALSE, startRow=r)
  }

  title <- paste("\n\n\n Evaluation of", files[j], " - Completed at", date(), "\n\n\n")
  cat(title)
  saveWorkbook(outwb, paste(files[j],"xlsx",sep='.'))
}

推荐答案

我反映了评论中的观点,即流程应该尽可能地向量化.在没有可重现的示例的情况下,我将不得不对您的数据做出一些假设.我假设该系列一个堆叠在另一个顶部,并带有一个指示该系列的变量.如果不是,如果系列在单独的列中,您可以使用 reshape2 融合数据框,然后使用下面的代码.

I mirror the sentiment in the comment, that the process should be vectorized as much as possible. In the absence of a reproducible example, I'll have to make some assumptions about your data. I assume that the series are stacked one on top of the other, with a variable indicating the series. If they are not, if the series are in separate columns, you can melt the data frame using reshape2, and then use the code below.

如果您使用的是 linux 或 mac 机器,那么您可以使用并行包和 mclapply,如果您设法对代码进行更多的矢量化.我偏爱 data.table - 如果你不熟悉,那么这可能是一个陡峭的攀登.

If you are using a linux or mac box, then you can use the parallel package and mclapply if you manage to vectorize your code more. I'm partial to data.table -if you are unfamiliar, then this may be a steep climb.

require(forecast)
require(data.table)
require(parallel)

#create 10 sample series -not 1800, but you get the idea.
data <- data.table(series=rep(1:10,each=60),
                   date=seq(as.Date("01/08/2009",format("%d/%m/%Y")),length.out=60,by="1 month"),
                   Volume=rep(rnorm(10),each=60)+rnorm(600))
#Define some functions to get the job done.    
getModel <- function(s) {
  data[order(date)][series==s][,ets(ts(Volume,start=as.yearmon(min(date),format("%d/%m/%Y")),frequency=12))]
}
getForecast <- function(s,forward=24) {
  model <- getModel(s)
  fc <- forecast(model,h=forward)
  return(data.frame(fc))
}
#Write the forecasts m at a time, where m is the number of cores.
Forecasts <- mclapply(1:10,getForecast)

使用您的数据框列表,您可以执行以下操作:

With your list of data frames, you can do something like:

mclapply(Forecasts, function(x) {addDataFrame(x, sheet, col.names=FALSE, startRow=r))

这篇关于如何在R中使用并行处理来分析大型时间序列数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆