R中时间序列数据的滑动时间间隔 [英] Sliding time intervals for time series data in R

查看:133
本文介绍了R中时间序列数据的滑动时间间隔的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试为不规则的时间序列数据集提取有趣的统计信息,但没有找到合适的工具来完成工作。可以很容易地找到用于操纵任何时间的定期采样时间序列或基于索引的序列的工具,尽管我对要解决的问题不太满意。

I am trying to extract interesting statistics for an irregular time series data set, but coming up short on finding the right tools for the job. The tools for manipulating regularly sampled time series or index-based series of any time are pretty easily found, though I'm not having much luck with the problems I'm trying to solve.

首先是可复制的数据集:

First, a reproducible data set:

library(zoo)
set.seed(0)
nSamples    <- 5000
vecDT       <- rexp(nSamples, 3)
vecTimes    <- cumsum(c(0,vecDT))
vecDrift    <- c(0, rnorm(nSamples, mean = 1/nSamples, sd = 0.01))
vecVals     <- cumsum(vecDrift)
vecZ        <- zoo(vecVals, order.by = vecTimes)
rm(vecDT, vecDrift)

假设时间以秒为单位。 vecZ 系列几乎有1700秒(不到30分钟),在此期间有5001个条目。 (注意:我会尝试使用 xts ,但是 xts 似乎需要日期信息,我宁愿不要

Assume the times are in seconds. There are almost 1700 seconds (just shy of 30 minutes) in the vecZ series, and 5001 entries during that time. (NB: I'd try using xts, but xts seems to need date information, and I'd rather not use a particular date when it's not relevant.)

我的目标如下:


  • 标识每个点之前3分钟和之后3分钟的值的索引。由于时间是连续的,所以我怀疑任意两点是否恰好相距3分钟。我想找到的点是给定点最多3分钟之前和之后至少3分钟,即类似以下内容(以伪代码表示):

  • Identify the indices of the values 3 minutes before and 3 minutes after each point. As the times are continuous, I doubt that any two points are precisely 3 minutes apart. What I'd like to find are the points that are at most 3 minutes prior, and at least 3 minutes after, the given point, i.e. something like the following (in pseudocode):

backIX(t,vecZ,tDelta)= min {长度的长度(vecZ):t-时间(ix)< tDelta}
forwardIX(t,vecZ,tDelta)= min {长度为ix(vecZ):时间(ix)-t> tDelta}

因此,在3分钟内, tDelta = 180 。如果 t = 2500 ,则 forwardIX()的结果将为3012(即time(vecZ)[2500]是860.1462,并且time(vecZ)[3012]是1040.403,或稍晚于180秒),并且 backwardIX()的输出将是2020(对应于时间680.7162)

So, for 3 minutes, tDelta = 180. If t=2500, then the result for forwardIX() would be 3012 (i.e. time(vecZ)[2500] is 860.1462, and time(vecZ)[3012] is 1040.403, or just over 180 seconds later), and the output of backwardIX() would be 2020 (corresponding to time 680.7162 seconds).

理想情况下,我想使用不需要 t 的函数要求对该函数进行 length(vecZ)调用,该函数忽略了可以更有效地计算滑动时间窗口的事实。

Ideally, I would like to use a function that does not require t, as that is going to require length(vecZ) calls to the function, which ignores the fact that sliding windows of time can be calculated more efficiently.

将函数应用于时间滚动窗口中的所有值。我见过 rollapply ,它具有固定的窗口大小(即,固定的索引数量,而不是固定的时间窗口)。我可以通过按索引 t 计算的循环(或 foreach ;-))来解决这一问题。 ,但我想知道是否已经实现了一些简单的功能,例如一个函数,用于计算给定时间范围内所有值的平均值。由于可以通过在窗口上滑动的简单摘要统计信息来有效地完成此操作,因此它在计算上应比多次访问所有数据以计算每个统计信息的函数便宜。一些相当自然的函数:平均值,最小值,最大值和中位数。

Apply a function to all values in a rolling window of time. I've seen rollapply, which takes a fixed window size (i.e. fixed number of indices, but not a fixed window of time). I can solve this the naive way, with a loop (or foreach ;-)) that is calculated per index t, but I wondered if there are some simple functions already implemented, e.g. a function to calculate the mean of all values in a given time frame. Since this can be done efficiently via simple summary statistics that slide over a window, it should be computationally cheaper than a function that accesses all of the data multiple times to calculate each statistic. Some fairly natural functions: mean, min, max, and median.

即使窗口随时间变化,改变窗口大小的能力也足够了,我可以使用上述问题的结果找到该窗口大小。但是,这似乎仍然需要额外的计算,因此能够指定基于时间的间隔似乎更加有效。

Even if the window isn't varying by time, the ability to vary the window size would be adequate, and I can find that window size using the result of the question above. However, that still seems to require excess calculations, so being able to specify time-based intervals seems more efficient.

R中是否有用于在时间窗口中进行此类数据处理的软件包,还是我不走运并且应该编写自己的函数?

Are there packages in R that facilitate such manipulations of data in time-windows, or am I out of luck and I should write my own functions?

注1:此问题试图做类似的事情,但不相交的间隔除外,而不是滚动时间窗,例如我可以对每个连续的3分钟块进行调整,以进行分析,但是我找不到一种方法可以对3分钟的间隔进行调整。

Note 1: This question seeks to do something similar, except over disjoint intervals, rather than rolling windows of time, e.g. I could adapt this to do my analysis on every successive 3 minute block, but I don't see a way to adapt this for rolling 3 minute intervals.

注2:我发现从 zoo 对象切换到数值向量(根据时间)已大大加快了第一个目标的范围查找/窗口端点识别的问题。那仍然是一个天真的算法,但是值得一提的是,对于天真的方法,使用 zoo 对象可能不是最佳选择。

Note 2: I've found that switching from a zoo object to a numeric vector (for the times) has significantly sped up the issue of range-finding / window endpoint identification for the first goal. That's still a naive algorithm, but it's worth mentioning that working with zoo objects may not be optimal for the naive approach.

推荐答案

从v1.9.8版本开始(2016年11月25日,CRAN),的问题已具有以非等额联接聚合的功能,可用于应用滚动功能在不规则时间序列的滑动时间窗口上。

As of version v1.9.8 (on CRAN 25 Nov 2016), data.table has gained the ability to aggregate in a non-equi join which can be used to apply a rolling function on a sliding time window of an irregular time series.

为演示和验证,使用了较小的数据集。

For demonstration and verification, a smaller dataset is used.

library(data.table)   # development version 1.11.9 used

# create small dataset
set.seed(0)
nSamples    <- 10
vecDT       <- rexp(nSamples, 3)
vecTimes    <- cumsum(c(0,vecDT))
vecVals     <- 0:nSamples
vec         <- data.table(vecTimes, vecVals)
vec




      vecTimes vecVals
 1: 0.00000000       0
 2: 0.06134553       1
 3: 0.10991444       2
 4: 0.15651286       3
 5: 0.30186907       4
 6: 1.26685858       5
 7: 1.67671260       6
 8: 1.85660688       7
 9: 2.17546271       8
10: 2.22447804       9
11: 2.68805641      10




# define window size in seconds 
win_sec = 0.3

# aggregate in sliding window by a non-equi join
vec[.(t = vecTimes, upper = vecTimes + win_sec, lower = vecTimes - win_sec), 
    on = .(vecTimes < upper, vecTimes > lower), 
    .(t, .N, sliding_mean = mean(vecVals)), by = .EACHI]




     vecTimes     vecTimes          t N sliding_mean
 1: 0.3000000 -0.300000000 0.00000000 4          1.5
 2: 0.3613455 -0.238654473 0.06134553 5          2.0
 3: 0.4099144 -0.190085564 0.10991444 5          2.0
 4: 0.4565129 -0.143487143 0.15651286 5          2.0
 5: 0.6018691  0.001869065 0.30186907 4          2.5
 6: 1.5668586  0.966858578 1.26685858 1          5.0
 7: 1.9767126  1.376712596 1.67671260 2          6.5
 8: 2.1566069  1.556606875 1.85660688 2          6.5
 9: 2.4754627  1.875462707 2.17546271 2          8.5
10: 2.5244780  1.924478037 2.22447804 2          8.5
11: 2.9880564  2.388056413 2.68805641 1         10.0


前两列显示的上限和下限时间间隔,分别是 t 是原始的 vec Times N 表示计算滑动平均值的数据点的数量。

The first two columns show the upper and lower bounds of the time intervall, resp., t is the original vecTimes, and N denotes the number of data points included in the calculation of the sliding mean.

这篇关于R中时间序列数据的滑动时间间隔的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆