自适应移动平均线 - R 中的顶级性能 [英] Adaptive moving average - top performance in R

查看:11
本文介绍了自适应移动平均线 - R 中的顶级性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找 R 中滚动/滑动窗口函数方面的一些性能提升.这是一项非常常见的任务,可用于任何有序的观察数据集.我想分享一些我的发现,也许有人可以提供反馈以使其更快.
重要的一点是,我关注的是 align="right" 和自适应滚动窗口的情况,所以 width 是一个向量(与我们的观察向量长度相同).如果我们将 width 作为标量,那么在 zooTTR 包中已经有了非常完善的功能,这些功能很难被击败(4 年后:这比我预期的要容易)因为其中一些甚至使用 Fortran(但使用下面提到的 wapply 仍然可以更快地使用用户定义的 FUN).
RcppRoll 包因其出色的性能而值得一提,但到目前为止还没有一个函数可以回答这个问题.如果有人可以扩展它来回答这个问题,那就太好了.

I am looking for some performance gains in terms of rolling/sliding window functions in R. It is quite common task which can be used in any ordered observations data set. I would like to share some of my findings, maybe somebody would be able to provide feedback to make it even faster.
Important note is that I focus on the case align="right" and adaptive rolling window, so width is a vector (same length as our observation vector). In case if we have width as scalar there are already very well developed functions in zoo and TTR packages which would be very hard to beat (4 years later: it was easier than I expected) as some of them are even using Fortran (but still user-defined FUNs can be faster using mentioned below wapply).
RcppRoll package is worth to mention due to its great performance, but so far there is no function which answers to that question. Would be great if someone could extend it to answer the question.

假设我们有以下数据:

x = c(120,105,118,140,142,141,135,152,154,138,125,132,131,120)
plot(x, type="l")

我们想在具有可变滚动窗口widthx向量上应用滚动函数.

And we want to apply rolling function over x vector with variable rolling window width.

set.seed(1)
width = sample(2:4,length(x),TRUE)

在这种特殊情况下,我们将滚动函数自适应于 c(2,3,4)sample.
我们将应用 mean 函数,预期结果:

In this particular case we would have rolling function adaptive to sample of c(2,3,4).
We will apply mean function, expected results:

r = f(x, width, FUN = mean)
print(r)
##  [1]       NA       NA 114.3333 120.7500 141.0000 135.2500 139.5000
##  [8] 142.6667 147.0000 146.0000 131.5000 128.5000 131.5000 127.6667
plot(x, type="l")
lines(r, col="red")

任何指标都可以用来生成 width 参数作为自适应移动平均线的不同变体,或任何其他函数.

Any indicator can be employed to produce width argument as different variants of adaptive moving averages, or any other function.

寻求最佳表现.

推荐答案

2018 年 12 月更新

自适应滚动功能的高效实现已在data.table 最近 - ?froll 手册.此外,已经确定了使用基础 R 的有效替代解决方案(fastama 下面).不幸的是,Kevin Ushey 的回答没有解决这个问题,因此它不包含在基准测试中.由于比较微秒毫无意义,因此增加了基准规模.

Efficient implementation of adaptive rolling functions has been made in data.table recently - more info in ?froll manual. Additionally an efficient alternative solution using base R has been identified (fastama below). Unfortunately Kevin Ushey's answer does not address the question thus it is not included in benchmark. Scale of benchmark has been increased as it pointless to compare microseconds.

set.seed(108)
x = rnorm(1e6)
width = rep(seq(from = 100, to = 500, by = 5), length.out=length(x))
microbenchmark(
  zoo=rollapplyr(x, width = width, FUN=mean, fill=NA),
  mapply=base_mapply(x, width=width, FUN=mean, na.rm=T),
  wmapply=wmapply(x, width=width, FUN=mean, na.rm=T),
  ama=ama(x, width, na.rm=T),
  fastama=fastama(x, width),
  frollmean=frollmean(x, width, na.rm=T, adaptive=TRUE),
  frollmean_exact=frollmean(x, width, na.rm=T, adaptive=TRUE, algo="exact"),
  times=1L
)
#Unit: milliseconds
#            expr          min           lq         mean       median           uq          max neval
#             zoo 32371.938248 32371.938248 32371.938248 32371.938248 32371.938248 32371.938248     1
#          mapply 13351.726032 13351.726032 13351.726032 13351.726032 13351.726032 13351.726032     1
#         wmapply 15114.774972 15114.774972 15114.774972 15114.774972 15114.774972 15114.774972     1
#             ama  9780.239091  9780.239091  9780.239091  9780.239091  9780.239091  9780.239091     1
#         fastama   351.618042   351.618042   351.618042   351.618042   351.618042   351.618042     1
#       frollmean     7.708054     7.708054     7.708054     7.708054     7.708054     7.708054     1
# frollmean_exact   194.115012   194.115012   194.115012   194.115012   194.115012   194.115012     1

ama = function(x, n, na.rm=FALSE, fill=NA, nf.rm=FALSE) {
  # more or less the same as previous forloopply
  stopifnot((nx<-length(x))==length(n))
  if (nf.rm) x[!is.finite(x)] = NA_real_
  ans = rep(NA_real_, nx)
  for (i in seq_along(x)) {
    ans[i] = if (i >= n[i])
      mean(x[(i-n[i]+1):i], na.rm=na.rm)
    else as.double(fill)
  }
  ans
}
fastama = function(x, n, na.rm, fill=NA) {
  if (!missing(na.rm)) stop("fast adaptive moving average implemented in R does not handle NAs, input having NAs will result in incorrect answer so not even try to compare to it")
  # fast implementation of adaptive moving average in R, in case of NAs incorrect answer
  stopifnot((nx<-length(x))==length(n))
  cs = cumsum(x)
  ans = rep(NA_real_, nx)
  for (i in seq_along(cs)) {
    ans[i] = if (i == n[i])
      cs[i]/n[i]
    else if (i > n[i])
      (cs[i]-cs[i-n[i]])/n[i]
    else as.double(fill)
  }
  ans
}

<小时>

旧答案:

我选择了 4 个不需要 C++ 的可用解决方案,很容易找到或 google.

I chose 4 available solutions which doesn't need to do to C++, quite easy to find or google.

# 1. rollapply
library(zoo)
?rollapplyr
# 2. mapply
base_mapply <- function(x, width, FUN, ...){
  FUN <- match.fun(FUN)
  f <- function(i, width, data){
    if(i < width) return(NA_real_)
    return(FUN(data[(i-(width-1)):i], ...))
  }
  mapply(FUN = f, 
         seq_along(x), width,
         MoreArgs = list(data = x))
}
# 3. wmapply - modified version of wapply found: https://rmazing.wordpress.com/2013/04/23/wapply-a-faster-but-less-functional-rollapply-for-vector-setups/
wmapply <- function(x, width, FUN = NULL, ...){
  FUN <- match.fun(FUN)
  SEQ1 <- 1:length(x)
  SEQ1[SEQ1 <  width] <- NA_integer_
  SEQ2 <- lapply(SEQ1, function(i) if(!is.na(i)) (i - (width[i]-1)):i)
  OUT <- lapply(SEQ2, function(i) if(!is.null(i)) FUN(x[i], ...) else NA_real_)
  return(base:::simplify2array(OUT, higher = TRUE))
}
# 4. forloopply - simple loop solution
forloopply <- function(x, width, FUN = NULL, ...){
  FUN <- match.fun(FUN)
  OUT <- numeric()
  for(i in 1:length(x)) {
    if(i < width[i]) next
    OUT[i] <- FUN(x[(i-(width[i]-1)):i], ...)
  }
  return(OUT)
}

以下是 prod 功能的时间安排.mean 函数可能已经在 rollapplyr 内部进行了优化.所有结果均等.

Below are the timings for prod function. mean function might be already optimized inside rollapplyr. All results equal.

library(microbenchmark)
# 1a. length(x) = 1000, window = 5-20
x <- runif(1000,0.5,1.5)
width <- rep(seq(from = 5, to = 20, by = 5), length(x)/4)
microbenchmark(
  rollapplyr(data = x, width = width, FUN = prod, fill = NA),
  base_mapply(x = x, width = width, FUN = prod, na.rm=T),
  wmapply(x = x, width = width, FUN = prod, na.rm=T),
  forloopply(x = x, width = width, FUN = prod, na.rm=T),
  times=100L
)
Unit: milliseconds
                                                       expr       min        lq    median       uq       max neval
 rollapplyr(data = x, width = width, FUN = prod, fill = NA) 59.690217 60.694364 61.979876 68.55698 153.60445   100
   base_mapply(x = x, width = width, FUN = prod, na.rm = T) 14.372537 14.694266 14.953234 16.00777  99.82199   100
       wmapply(x = x, width = width, FUN = prod, na.rm = T)  9.384938  9.755893  9.872079 10.09932  84.82886   100
    forloopply(x = x, width = width, FUN = prod, na.rm = T) 14.730428 15.062188 15.305059 15.76560 342.44173   100

# 1b. length(x) = 1000, window = 50-200
x <- runif(1000,0.5,1.5)
width <- rep(seq(from = 50, to = 200, by = 50), length(x)/4)
microbenchmark(
  rollapplyr(data = x, width = width, FUN = prod, fill = NA),
  base_mapply(x = x, width = width, FUN = prod, na.rm=T),
  wmapply(x = x, width = width, FUN = prod, na.rm=T),
  forloopply(x = x, width = width, FUN = prod, na.rm=T),
  times=100L
)
Unit: milliseconds
                                                       expr      min       lq   median       uq      max neval
 rollapplyr(data = x, width = width, FUN = prod, fill = NA) 71.99894 74.19434 75.44112 86.44893 281.6237   100
   base_mapply(x = x, width = width, FUN = prod, na.rm = T) 15.67158 16.10320 16.39249 17.20346 103.6211   100
       wmapply(x = x, width = width, FUN = prod, na.rm = T) 10.88882 11.54721 11.75229 12.19790 106.1170   100
    forloopply(x = x, width = width, FUN = prod, na.rm = T) 15.70704 16.06983 16.40393 17.14210 108.5005   100

# 2a. length(x) = 10000, window = 5-20
x <- runif(10000,0.5,1.5)
width <- rep(seq(from = 5, to = 20, by = 5), length(x)/4)
microbenchmark(
  rollapplyr(data = x, width = width, FUN = prod, fill = NA),
  base_mapply(x = x, width = width, FUN = prod, na.rm=T),
  wmapply(x = x, width = width, FUN = prod, na.rm=T),
  forloopply(x = x, width = width, FUN = prod, na.rm=T),
  times=100L
)
Unit: milliseconds
                                                       expr       min       lq   median       uq       max neval
 rollapplyr(data = x, width = width, FUN = prod, fill = NA) 753.87882 781.8789 809.7680 872.8405 1116.7021   100
   base_mapply(x = x, width = width, FUN = prod, na.rm = T) 148.54919 159.9986 231.5387 239.9183  339.7270   100
       wmapply(x = x, width = width, FUN = prod, na.rm = T)  98.42682 105.2641 117.4923 183.4472  245.4577   100
    forloopply(x = x, width = width, FUN = prod, na.rm = T) 533.95641 602.0652 646.7420 672.7483  922.3317   100

# 2b. length(x) = 10000, window = 50-200
x <- runif(10000,0.5,1.5)
width <- rep(seq(from = 50, to = 200, by = 50), length(x)/4)
microbenchmark(
  rollapplyr(data = x, width = width, FUN = prod, fill = NA),
  base_mapply(x = x, width = width, FUN = prod, na.rm=T),
  wmapply(x = x, width = width, FUN = prod, na.rm=T),
  forloopply(x = x, width = width, FUN = prod, na.rm=T),
  times=100L
)
Unit: milliseconds
                                                       expr      min       lq    median        uq       max neval
 rollapplyr(data = x, width = width, FUN = prod, fill = NA) 912.5829 946.2971 1024.7245 1071.5599 1431.5289   100
   base_mapply(x = x, width = width, FUN = prod, na.rm = T) 171.3189 180.6014  260.8817  269.5672  344.4500   100
       wmapply(x = x, width = width, FUN = prod, na.rm = T) 123.1964 131.1663  204.6064  221.1004  484.3636   100
    forloopply(x = x, width = width, FUN = prod, na.rm = T) 561.2993 696.5583  800.9197  959.6298 1273.5350   100

这篇关于自适应移动平均线 - R 中的顶级性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆