R:具有可调节窗口和步长的滚动窗口功能,用于不规则间隔的观察 [英] R: Rolling window function with adjustable window and step-size for irregularly spaced observations

查看:15
本文介绍了R:具有可调节窗口和步长的滚动窗口功能,用于不规则间隔的观察的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设有一个 2 列数据框,其中时间或距离列顺序增加,而观察列可能在这里和那里有 NA.我如何有效地使用滑动窗口函数来获得一些统计数据,比如平均值,对于持续时间 X(例如 5 秒)的窗口中的观察,将窗口滑动 Y 秒(例如 2.5 秒),重复... 窗口中的观察数量基于时间列,因此每个窗口的观察数量和滑动窗口的观察数量可能会有所不同该函数应该接受任何窗口大小,直到数字观测值和步长.

Say there is a 2-column data frame with a time or distance column which sequentially increases and an observation column which may have NAs here and there. How can I efficiently use a sliding window function to get some statistic, say a mean, for the observations in a window of duration X (e.g. 5 seconds), slide the window over Y seconds (e.g. 2.5 seconds), repeat... The number of observations in the window is based on the time column, thus both the number of observations per window and the number of observations to slide the window may vary The function should accept any window size up to the number of observations and a step size.

这是样本数据(请参阅"了解更大的样本集)

Here is sample data (see "" for a larger sample set)

set.seed(42)
dat <- data.frame(time = seq(1:20)+runif(20,0,1))
dat <- data.frame(dat, measure=c(diff(dat$time),NA_real_))
dat$measure[sample(1:19,2)] <- NA_real_
head(dat)
      time   measure
1 1.914806 1.0222694
2 2.937075 0.3490641
3 3.286140        NA
4 4.830448 0.8112979
5 5.641746 0.8773504
6 6.519096 1.2174924

Desired Output 针对 5 秒窗口、2.5 秒步长、第一个窗口从 -2.5 到 2.5、na.rm=FALSE 的特定情况:

Desired Output for the specific case of a 5 second window, 2.5 second step, first window from -2.5 to 2.5, na.rm=FALSE:

 [1] 1.0222694
 [2]        NA
 [3]        NA
 [4] 1.0126639
 [5] 0.9965048
 [6] 0.9514456
 [7] 1.0518228
 [8]        NA
 [9]        NA
[10]        NA

解释:在所需的输出中,第一个窗口查找介于 -2.5 和 2.5 之间的时间.在这个窗口中观察到一个测量值,它不是一个 NA,因此我们得到这个观察结果:1.0222694.下一个窗口是从0到5,窗口中有一个NA,所以我们得到NA.从 2.5 到 7.5 的窗口也是如此.下一个窗口是从 5 到 10.窗口中有 5 个观测值,没有一个是 NA.因此,我们得到这 5 个观察值的平均值(即 mean(dat[dat$time >5 & dat$time <10,'measure']) )

Explanation: In the desired output the very first window looks for times between -2.5 and 2.5. One observation of measure is in this window, and it is not an NA, thus we get that observation: 1.0222694. The next window is from 0 to 5, and there is an NA in the window, so we get NA. Same for the window from 2.5 to 7.5. The next window is from 5 to 10. There are 5 observations in the window, none are NA. So, we get the average of those 5 observations (i.e. mean(dat[dat$time >5 & dat$time <10,'measure']) )

我尝试了什么:以下是我针对步长为窗口持续时间 1/2 的窗口的特定情况所尝试的:

What I tried: Here is what I tried for the specific case of a window where the step size is 1/2 the window duration:

windo <- 5  # duration in seconds of window

# partition into groups depending on which window(s) an observation falls in
# When step size >= window/2 and < window, need two grouping vectors
leaf1 <- round(ceiling(dat$time/(windo/2))+0.5)
leaf2 <- round(ceiling(dat$time/(windo/2))-0.5) 

l1 <- tapply(dat$measure, leaf1, mean)
l2 <- tapply(dat$measure, leaf2, mean)

as.vector(rbind(l2,l1))

不灵活、不优雅、不高效.如果步长不是窗口大小的 1/2,则该方法将无法正常工作.

Not flexible, not elegant, not efficient. If step size isn't 1/2 window size, the approach will not work, as is.

对此类问题的一般解决方案有什么想法吗?任何解决方案都是可以接受的.越快越好,尽管我更喜欢使用基本 R、data.table、Rcpp 和/或并行计算的解决方案.在我的真实数据集中,数据框列表中包含数百万个观察值(最大数据框约为 400,000 个观察值).

Any thoughts on a general solution to this kind of problem? Any solution is acceptable. The faster the better, though I prefer solutions using base R, data.table, Rcpp, and/or parallel computation. In my real data set, there are several millions of observations contained in a list of data frames (max data frame is ~400,000 observations).

以下是额外信息:更大的样本集

根据要求,这是一个更大、更真实的示例数据集,具有更多的 NA 和最小时间跨度 (~0.03).不过,需要明确的是,数据帧列表包含像上面这样的小帧,以及像下面这样和更大的帧:

As per request, here is a larger, more realistic example dataset with many more NAs and the minimum time span (~0.03). To be clear, though, the list of data frames contains small ones like the one above, as well as ones like the following and larger:

set.seed(42)
dat <- data.frame(time = seq(1:50000)+runif(50000, 0.025, 1))
dat <- data.frame(dat, measure=c(diff(dat$time),NA_real_))
dat$measure[sample(1:50000,1000)] <- NA_real_
dat$measure[c(350:450,3000:3300, 20000:28100)] <- NA_real_
dat <- dat[-c(1000:2000, 30000:35000),] 

# a list with a realistic number of observations:
dat <- lapply(1:300,function(x) dat)

推荐答案

这里尝试使用 Rcpp.该函数假定数据是按时间排序的.建议进行更多测试并进行调整.

Here is an attempt with Rcpp. The function assumes that data is sorted according to time. More testing would be advisable and adjustments could be made.

#include <Rcpp.h>
using namespace Rcpp;


// [[Rcpp::export]]
NumericVector rollAverage(const NumericVector & times, 
                          NumericVector & vals, 
                          double start,
                          const double winlen, 
                          const double winshift) {
  int n = ceil((max(times) - start) / winshift);
  NumericVector winvals;
  NumericVector means(n);
  int ind1(0), ind2(0);
  for(int i=0; i < n; i++) {
    if (times[0] < (start+winlen)) {
      while((times[ind1] <= start) & 
                (times[ind1+1] <= (start+winlen)) & 
                (ind1 < (times.size() - 1))) {
        ind1++;
      }    

      while((times[ind2+1] <= (start+winlen)) & (ind2 < (times.size() - 1))) {
        ind2++;
      }  

      if (times[ind1] >= start) {
        winvals = vals[seq(ind1, ind2)];
        means[i] = mean(winvals);
      } else {
        means[i] = NA_REAL;
      }
      } else {
        means[i] = NA_REAL;
    }

    start += winshift;    
  }

   return means;
}

测试它:

set.seed(42)
dat <- data.frame(time = seq(1:20)+runif(20,0,1))
dat <- data.frame(dat, measure=c(diff(dat$time),NA_real_))
dat$measure[sample(1:19,2)] <- NA_real_

rollAverage(dat$time, dat$measure, -2.5, 5.0, 2.5)
#[1] 1.0222694        NA        NA 1.0126639 0.9965048 0.9514456 1.0518228        NA        NA        NA

使用您的 data.frames 列表(使用 data.table):

With your list of data.frames (using data.table):

set.seed(42)
dat <- data.frame(time = seq(1:50000)+runif(50000, 0.025, 1))
dat <- data.frame(dat, measure=c(diff(dat$time),NA_real_))
dat$measure[sample(1:50000,1000)] <- NA_real_
dat$measure[c(350:450,3000:3300, 20000:28100)] <- NA_real_
dat <- dat[-c(1000:2000, 30000:35000),] 

# a list with a realistic number of observations:
dat <- lapply(1:300,function(x) dat)

library(data.table)
dat <- lapply(dat, setDT)
for (ind in seq_along(dat)) dat[[ind]][, i := ind]
#possibly there is a way to avoid these copies?

dat <- rbindlist(dat)

system.time(res <- dat[, rollAverage(time, measure, -2.5, 5.0, 2.5), by=i])
#user  system elapsed 
#1.51    0.02    1.54 
print(res)
#           i        V1
#      1:   1 1.0217126
#      2:   1 0.9334415
#      3:   1 0.9609050
#      4:   1 1.0123473
#      5:   1 0.9965922
#     ---              
#6000596: 300 1.1121296
#6000597: 300 0.9984581
#6000598: 300 1.0093060
#6000599: 300        NA
#6000600: 300        NA

这篇关于R:具有可调节窗口和步长的滚动窗口功能,用于不规则间隔的观察的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆