处理时间序列数据中的连续缺失值 [英] Handle Continous Missing values in time-series data

查看:1146
本文介绍了处理时间序列数据中的连续缺失值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个时序数据,如下所示.

I have a time-series data as shown below.

2015-04-26 23:00:00  5704.27388916015661380
2015-04-27 00:00:00  4470.30868326822928793
2015-04-27 01:00:00  4552.57241617838553793
2015-04-27 02:00:00  4570.22250032825650123
2015-04-27 03:00:00  NA
2015-04-27 04:00:00  NA
2015-04-27 05:00:00  NA
2015-04-27 06:00:00 12697.37724086216439900
2015-04-27 07:00:00  5538.71119009653739340
2015-04-27 08:00:00    81.95060647328695325
2015-04-27 09:00:00  8550.65816895300667966
2015-04-27 10:00:00  2925.76573206583680076

我应如何处理连续NA值.如果我只有一个NA,则使用NA取极值的平均值.是否有处理连续缺失值的标准方法?

How should I handle Continous NA values. In cases where I have only one NA, I use to take the average of extreme values of NA entry. Are there any standard approaches to deal with continuous missing values?

推荐答案

zoo包具有用于处理NA值的多个功能.以下功能之一可能满足您的需求:

The zoo package has several functions for dealing with NA values. One of the following functions might suit your needs:

  • na.locf:结转最近的观察.使用参数fromLast = TRUE对应于下一个向后进行的观测(NOCB).
  • na.aggregate:用某个汇总值替换NA.默认的聚合功能是mean,但是您也可以指定其他功能.有关更多信息,请参见?na.aggregate.
  • na.approx:NA替换为线性插值.
  • na.locf: Last observation carried forward. Using the parameter fromLast = TRUE corresponds to next observation carried backward (NOCB).
  • na.aggregate: Replace the NA's with some aggregated value. The default aggregation function is the mean, but you can specify other functions as well. See ?na.aggregate for more info.
  • na.approx: NA's are replaced with linear interpolated values.

您可以比较结果以查看这些功能的作用:

You can compare the outcomes to see what these functions do:

library(zoo)
df$V.loc <- na.locf(df$V2)
df$V.agg <- na.aggregate(df$V2)
df$V.app <- na.approx(df$V2)

这导致:

> df
                    V1          V2       V.loc       V.agg       V.app
1  2015-04-26 23:00:00  5704.27389  5704.27389  5704.27389  5704.27389
2  2015-04-27 00:00:00  4470.30868  4470.30868  4470.30868  4470.30868
3  2015-04-27 01:00:00  4552.57242  4552.57242  4552.57242  4552.57242
4  2015-04-27 02:00:00  4570.22250  4570.22250  4570.22250  4570.22250
5  2015-04-27 03:00:00          NA  4570.22250  5454.64894  6602.01119
6  2015-04-27 04:00:00          NA  4570.22250  5454.64894  8633.79987
7  2015-04-27 05:00:00          NA  4570.22250  5454.64894 10665.58856
8  2015-04-27 06:00:00 12697.37724 12697.37724 12697.37724 12697.37724
9  2015-04-27 07:00:00  5538.71119  5538.71119  5538.71119  5538.71119
10 2015-04-27 08:00:00    81.95061    81.95061    81.95061    81.95061
11 2015-04-27 09:00:00  8550.65817  8550.65817  8550.65817  8550.65817
12 2015-04-27 10:00:00  2925.76573  2925.76573  2925.76573  2925.76573


使用的数据:


Used data:

df <- structure(list(V1 = structure(c(1430082000, 1430085600, 1430089200, 1430092800, 1430096400, 1430100000, 1430103600, 1430107200, 1430110800, 1430114400, 1430118000, 1430121600), class = c("POSIXct", "POSIXt"), tzone = ""), V2 = c(5704.27388916016, 4470.30868326823, 4552.57241617839, 4570.22250032826, NA, NA, NA, 12697.3772408622, 5538.71119009654, 81.950606473287, 8550.65816895301, 2925.76573206584)), .Names = c("V1", "V2"), row.names = c(NA, -12L), class = "data.frame")


添加:


Addition:

imputeTSforecast程序包中还包含其他用于处理NA的时间序列函数(还有一些更高级的函数).

There are also additional time series functions for dealing with NAs in the imputeTS and the forecast package (also some more advanced functions).

例如:

 library("imputeTS")

 # Moving Average Imputation
 na.ma(df$V2)

 # Imputation via Kalman Smoothing on structural time series models 
 na.kalman(df$V2)

 # Just interpolation but with some nice options (linear, spline,stine)
 na.interpolation(df$V2)

library("forecast")

#Interpolation via seasonal decomposition and interpolation
na.interp(df$V2)

这篇关于处理时间序列数据中的连续缺失值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆