的R - 更快的方式计算滚动统计过的可变间隔 [英] R - Faster Way to Calculate Rolling Statistics Over a Variable Interval

查看:325
本文介绍了的R - 更快的方式计算滚动统计过的可变间隔的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我很好奇,如果有任何人能拿出一个(快)的方法来计算的滚动统计在一段时间(窗口)的可变间隔(滚动平均值,中位数,百分等)。

I'm curious if anyone out there can come up with a (faster) way to calculate rolling statistics (rolling mean, median, percentiles, etc.) over a variable interval of time (windowing).

也就是说,假设给出一个随机定时观测(即不是每天或每周的数据,观测只是有一个时间标记,如蜱数据),假设你想看看中央和分散的统计数据,你能够扩大并加强对这些计算这些统计数据的时间间隔。

That is, suppose one is given randomly timed observations (i.e. not daily, or weekly data, observations just have a time stamp, as in ticks data), and suppose you'd like to look at center and dispersion statistics that you are able to widen and tighten the interval of time over which these statistics are calculated.

我做了一个简单的for循环,这是否。但它显然运行速度很慢(其实我觉得我的循环仍超过数据中的一小我成立了以测试其速度运行)。我一直试图得到类似ddply做到这一点 - 这似乎straitforward去进行日常统计运行 - 但我似乎无法工作,我的出路吧。

I made a simple for loop that does this. But it obviously runs very slow (In fact I think my loop is still running over a small sample of data I set up to test its speed). I've been trying to get something like ddply to do this - which seems straitforward to get to run for daily stats - but I can't seem to work my way out of it.

例如:

示例设置:

df <- data.frame(Date = runif(1000,0,30))
df$Price <- I((df$Date)^0.5 * (rnorm(1000,30,4)))
df$Date <- as.Date(df$Date, origin = "1970-01-01")

示例功能(运行与许多意见很慢

SummaryStats <- function(dataframe, interval){
  # Returns daily simple summary stats, 
  # at varying intervals
  # dataframe is the data frame in question, with Date and Price obs
  # interval is the width of time to be treated as a day

  firstDay <- min(dataframe$Date)
  lastDay  <- max(dataframe$Date)
  result <- data.frame(Date = NULL,
                       Average = NULL,  Median = NULL,
                       Count = NULL,
                       Percentile25 = NULL, Percentile75 = NULL)

  for (Day in firstDay:lastDay){

    dataframe.sub = subset(dataframe,
                Date > (Day - (interval/2))
                & Date < (Day + (interval/2)))

    nu = data.frame(Date = Day, 
                    Average = mean(dataframe.sub$Price),
                    Median = median(dataframe.sub$Price),
                    Count = length(dataframe.sub$Price),
                    P25 = quantile(dataframe.sub$Price, 0.25),
                    P75 = quantile(dataframe.sub$Price, 0.75))

    result = rbind(result,nu)

  }

  return(result)

}

您的建议是欢迎!

推荐答案

RCPP 是,如果速度是你主要关心的一个好方法。我将使用滚动平均值统计用例子来解释。

Rcpp is a good approach if speed is your primary concern. I'll use the rolling mean statistic to explain by example.

测试:RCPP与ř

x = sort(runif(25000,0,4*pi))
y = sin(x) + rnorm(length(x),0.5,0.5)
system.time( rollmean_r(x,y,xout=x,width=1.1) )   # ~60 seconds
system.time( rollmean_cpp(x,y,xout=x,width=1.1) ) # ~0.0007 seconds

code为RCPP和R功能

cppFunction('
  NumericVector rollmean_cpp( NumericVector x, NumericVector y, 
                              NumericVector xout, double width) {
    double total=0;
    unsigned int n=x.size(), nout=xout.size(), i, ledge=0, redge=0;
    NumericVector out(nout);

    for( i=0; i<nout; i++ ) {
      while( x[ redge ] - xout[i] <= width && redge<n ) 
        total += y[redge++];
      while( xout[i] - x[ ledge ] > width && ledge<n ) 
        total -= y[ledge++];
      if( ledge==redge ) { out[i]=NAN; total=0; continue; }
      out[i] = total / (redge-ledge);
    }
    return out;
  }')

rollmean_r = function(x,y,xout,width) {
  out = numeric(length(xout))
  for( i in seq_along(xout) ) {
    window = x >= (xout[i]-width) & x <= (xout[i]+width)
    out[i] = .Internal(mean( y[window] ))
  }
  return(out)
}

现在的 rollmean_cpp 的explantion。 X 是数据。 XOUT 是指轧制统计数字是要求点的矢量。 宽度是滚动窗口的宽度* 2。需要注意的是滑动窗口的两端的indeces存储在窗台 R边缘。这些基本上指针到相应的元素 X 。这些的indeces可以是用于调用其他C ++函数(例如,中值等),该采取的载体和开始和结束的indeces作为输入非常有利的。

Now for an explantion of rollmean_cpp. x and y are the data. xout is a vector of points at which the rolling statistic is requested. width is the width*2 of the rolling window. Note that the indeces for the ends of sliding window are stored in ledge and redge. These are essentially pointers to the respective elements in x and y. These indeces could be very beneficial for calling other C++ functions (e.g., median and the like) that take a vector and starting and ending indeces as input.

对于那些谁想要 rollmean_cpp 的用于调试的冗长的版本(长篇):

For those who want a "verbose" version of rollmean_cpp for debugging (lengthy):

cppFunction('
  NumericVector rollmean_cpp( NumericVector x, NumericVector y, 
                              NumericVector xout, double width) {

    double total=0, oldtotal=0;
    unsigned int n=x.size(), nout=xout.size(), i, ledge=0, redge=0;
    NumericVector out(nout);


    for( i=0; i<nout; i++ ) {
      Rcout << "Finding window "<< i << " for x=" << xout[i] << "..." << std::endl;
      total = 0;

      // numbers to push into window
      while( x[ redge ] - xout[i] <= width && redge<n ) {
        Rcout << "Adding (x,y) = (" << x[redge] << "," << y[redge] << ")" ;
        Rcout << "; edges=[" << ledge << "," << redge << "]" << std::endl;
        total += y[redge++];
      }

      // numbers to pop off window
      while( xout[i] - x[ ledge ] > width && ledge<n ) {
        Rcout << "Removing (x,y) = (" << x[ledge] << "," << y[ledge] << ")";
        Rcout << "; edges=[" << ledge+1 << "," << redge-1 << "]" << std::endl;
        total -= y[ledge++];
      }
      if(ledge==n) Rcout << " OVER ";
      if( ledge==redge ) {
       Rcout<<" NO DATA IN INTERVAL " << std::endl << std::endl;
       oldtotal=total=0; out[i]=NAN; continue;}

      Rcout << "For interval [" << xout[i]-width << "," <<
               xout[i]+width << "], all points in interval [" << x[ledge] <<
               ", " << x[redge-1] << "]" << std::endl ;
      Rcout << std::endl;

      out[i] = ( oldtotal + total ) / (redge-ledge);
      oldtotal=total+oldtotal;
    }
    return out;
  }')

x = c(1,2,3,6,90,91)
y = c(9,8,7,5.2,2,1)
xout = c(1,2,2,3,6,6.1,13,90,100)
a = rollmean_cpp(x,y,xout=xout,2)
# Finding window 0 for x=1...
# Adding (x,y) = (1,9); edges=[0,0]
# Adding (x,y) = (2,8); edges=[0,1]
# Adding (x,y) = (3,7); edges=[0,2]
# For interval [-1,3], all points in interval [1, 3]
# 
# Finding window 1 for x=2...
# For interval [0,4], all points in interval [1, 3]
# 
# Finding window 2 for x=2...
# For interval [0,4], all points in interval [1, 3]
# 
# Finding window 3 for x=3...
# For interval [1,5], all points in interval [1, 3]
# 
# Finding window 4 for x=6...
# Adding (x,y) = (6,5.2); edges=[0,3]
# Removing (x,y) = (1,9); edges=[1,3]
# Removing (x,y) = (2,8); edges=[2,3]
# Removing (x,y) = (3,7); edges=[3,3]
# For interval [4,8], all points in interval [6, 6]
# 
# Finding window 5 for x=6.1...
# For interval [4.1,8.1], all points in interval [6, 6]
# 
# Finding window 6 for x=13...
# Removing (x,y) = (6,5.2); edges=[4,3]
# NO DATA IN INTERVAL 
# 
# Finding window 7 for x=90...
# Adding (x,y) = (90,2); edges=[4,4]
# Adding (x,y) = (91,1); edges=[4,5]
# For interval [88,92], all points in interval [90, 91]
# 
# Finding window 8 for x=100...
# Removing (x,y) = (90,2); edges=[5,5]
# Removing (x,y) = (91,1); edges=[6,5]
# OVER  NO DATA IN INTERVAL 

print(a)
# [1] 8.0 8.0 8.0 8.0 5.2 5.2 NaN 1.5 NaN

这篇关于的R - 更快的方式计算滚动统计过的可变间隔的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆