R - 在可变间隔内计算滚动统计的更快方法 [英] R - Faster Way to Calculate Rolling Statistics Over a Variable Interval
问题描述
我很好奇是否有人能想出一种(更快)方法来计算可变时间间隔(窗口)内的滚动统计数据(滚动平均值、中位数、百分位数等).
I'm curious if anyone out there can come up with a (faster) way to calculate rolling statistics (rolling mean, median, percentiles, etc.) over a variable interval of time (windowing).
也就是说,假设给定随机定时观察(即不是每日或每周数据,观察只有一个时间戳,如滴答数据),并假设您想查看中心和离差统计数据能够扩大和缩小计算这些统计数据的时间间隔.
That is, suppose one is given randomly timed observations (i.e. not daily, or weekly data, observations just have a time stamp, as in ticks data), and suppose you'd like to look at center and dispersion statistics that you are able to widen and tighten the interval of time over which these statistics are calculated.
我做了一个简单的 for 循环来做到这一点.但它显然运行得很慢(事实上,我认为我的循环仍在运行我为测试其速度而设置的一小部分数据样本).我一直在尝试使用 ddply 之类的方法来执行此操作 - 这对于运行每日统计数据似乎很严格 - 但我似乎无法摆脱它.
I made a simple for loop that does this. But it obviously runs very slow (In fact I think my loop is still running over a small sample of data I set up to test its speed). I've been trying to get something like ddply to do this - which seems straitforward to get to run for daily stats - but I can't seem to work my way out of it.
示例:
样品设置:
df <- data.frame(Date = runif(1000,0,30))
df$Price <- I((df$Date)^0.5 * (rnorm(1000,30,4)))
df$Date <- as.Date(df$Date, origin = "1970-01-01")
示例函数(运行速度非常慢,有很多观察结果
SummaryStats <- function(dataframe, interval){
# Returns daily simple summary stats,
# at varying intervals
# dataframe is the data frame in question, with Date and Price obs
# interval is the width of time to be treated as a day
firstDay <- min(dataframe$Date)
lastDay <- max(dataframe$Date)
result <- data.frame(Date = NULL,
Average = NULL, Median = NULL,
Count = NULL,
Percentile25 = NULL, Percentile75 = NULL)
for (Day in firstDay:lastDay){
dataframe.sub = subset(dataframe,
Date > (Day - (interval/2))
& Date < (Day + (interval/2)))
nu = data.frame(Date = Day,
Average = mean(dataframe.sub$Price),
Median = median(dataframe.sub$Price),
Count = length(dataframe.sub$Price),
P25 = quantile(dataframe.sub$Price, 0.25),
P75 = quantile(dataframe.sub$Price, 0.75))
result = rbind(result,nu)
}
return(result)
}
欢迎您的建议!
推荐答案
Rcpp如果速度是您的主要关注点,这是一个很好的方法.我将使用滚动均值统计量来举例说明.
Rcpp is a good approach if speed is your primary concern. I'll use the rolling mean statistic to explain by example.
基准:Rcpp 与 R
x = sort(runif(25000,0,4*pi))
y = sin(x) + rnorm(length(x),0.5,0.5)
system.time( rollmean_r(x,y,xout=x,width=1.1) ) # ~60 seconds
system.time( rollmean_cpp(x,y,xout=x,width=1.1) ) # ~0.0007 seconds
Rcpp 和 R 函数的代码
cppFunction('
NumericVector rollmean_cpp( NumericVector x, NumericVector y,
NumericVector xout, double width) {
double total=0;
unsigned int n=x.size(), nout=xout.size(), i, ledge=0, redge=0;
NumericVector out(nout);
for( i=0; i<nout; i++ ) {
while( x[ redge ] - xout[i] <= width && redge<n )
total += y[redge++];
while( xout[i] - x[ ledge ] > width && ledge<n )
total -= y[ledge++];
if( ledge==redge ) { out[i]=NAN; total=0; continue; }
out[i] = total / (redge-ledge);
}
return out;
}')
rollmean_r = function(x,y,xout,width) {
out = numeric(length(xout))
for( i in seq_along(xout) ) {
window = x >= (xout[i]-width) & x <= (xout[i]+width)
out[i] = .Internal(mean( y[window] ))
}
return(out)
}
现在解释rollmean_cpp
.x
和 y
是数据.xout
是请求滚动统计的点的向量.width
是滚动窗口的宽度*2.请注意,滑动窗口末端的 indeces 存储在 ledge
和 rededge
中.这些本质上是指向 x
和 y
中相应元素的指针.这些 indeces 对于调用其他 C++ 函数(例如,median 等)非常有益,这些函数将向量和起始和结束 indeces 作为输入.
Now for an explantion of rollmean_cpp
. x
and y
are the data. xout
is a vector of points at which the rolling statistic is requested. width
is the width*2 of the rolling window. Note that the indeces for the ends of sliding window are stored in ledge
and redge
. These are essentially pointers to the respective elements in x
and y
. These indeces could be very beneficial for calling other C++ functions (e.g., median and the like) that take a vector and starting and ending indeces as input.
对于那些想要详细"版本的 rollmean_cpp
进行调试(冗长)的人:
For those who want a "verbose" version of rollmean_cpp
for debugging (lengthy):
cppFunction('
NumericVector rollmean_cpp( NumericVector x, NumericVector y,
NumericVector xout, double width) {
double total=0, oldtotal=0;
unsigned int n=x.size(), nout=xout.size(), i, ledge=0, redge=0;
NumericVector out(nout);
for( i=0; i<nout; i++ ) {
Rcout << "Finding window "<< i << " for x=" << xout[i] << "..." << std::endl;
total = 0;
// numbers to push into window
while( x[ redge ] - xout[i] <= width && redge<n ) {
Rcout << "Adding (x,y) = (" << x[redge] << "," << y[redge] << ")" ;
Rcout << "; edges=[" << ledge << "," << redge << "]" << std::endl;
total += y[redge++];
}
// numbers to pop off window
while( xout[i] - x[ ledge ] > width && ledge<n ) {
Rcout << "Removing (x,y) = (" << x[ledge] << "," << y[ledge] << ")";
Rcout << "; edges=[" << ledge+1 << "," << redge-1 << "]" << std::endl;
total -= y[ledge++];
}
if(ledge==n) Rcout << " OVER ";
if( ledge==redge ) {
Rcout<<" NO DATA IN INTERVAL " << std::endl << std::endl;
oldtotal=total=0; out[i]=NAN; continue;}
Rcout << "For interval [" << xout[i]-width << "," <<
xout[i]+width << "], all points in interval [" << x[ledge] <<
", " << x[redge-1] << "]" << std::endl ;
Rcout << std::endl;
out[i] = ( oldtotal + total ) / (redge-ledge);
oldtotal=total+oldtotal;
}
return out;
}')
x = c(1,2,3,6,90,91)
y = c(9,8,7,5.2,2,1)
xout = c(1,2,2,3,6,6.1,13,90,100)
a = rollmean_cpp(x,y,xout=xout,2)
# Finding window 0 for x=1...
# Adding (x,y) = (1,9); edges=[0,0]
# Adding (x,y) = (2,8); edges=[0,1]
# Adding (x,y) = (3,7); edges=[0,2]
# For interval [-1,3], all points in interval [1, 3]
#
# Finding window 1 for x=2...
# For interval [0,4], all points in interval [1, 3]
#
# Finding window 2 for x=2...
# For interval [0,4], all points in interval [1, 3]
#
# Finding window 3 for x=3...
# For interval [1,5], all points in interval [1, 3]
#
# Finding window 4 for x=6...
# Adding (x,y) = (6,5.2); edges=[0,3]
# Removing (x,y) = (1,9); edges=[1,3]
# Removing (x,y) = (2,8); edges=[2,3]
# Removing (x,y) = (3,7); edges=[3,3]
# For interval [4,8], all points in interval [6, 6]
#
# Finding window 5 for x=6.1...
# For interval [4.1,8.1], all points in interval [6, 6]
#
# Finding window 6 for x=13...
# Removing (x,y) = (6,5.2); edges=[4,3]
# NO DATA IN INTERVAL
#
# Finding window 7 for x=90...
# Adding (x,y) = (90,2); edges=[4,4]
# Adding (x,y) = (91,1); edges=[4,5]
# For interval [88,92], all points in interval [90, 91]
#
# Finding window 8 for x=100...
# Removing (x,y) = (90,2); edges=[5,5]
# Removing (x,y) = (91,1); edges=[6,5]
# OVER NO DATA IN INTERVAL
print(a)
# [1] 8.0 8.0 8.0 8.0 5.2 5.2 NaN 1.5 NaN
这篇关于R - 在可变间隔内计算滚动统计的更快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!