用LCOF和NOCB方法填补时间序列的空白,但要确认时间序列的中断 [英] Fill in time series gaps with both LCOF and NOCB methods but acknowledge breaks in time series

查看:228
本文介绍了用LCOF和NOCB方法填补时间序列的空白,但要确认时间序列的中断的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

此帖子的末尾有修改.

我有大量的个人饮食数据集.每个人都有随机丢失的数据.这是一个人的例子(我将最终将此解决方案推广到整个人群):

I have a large dataset of daily dietary records for a population of individuals. There are data missing at random from each of the individuals. This is an example for one individual (I will eventually generalize this solution to the population):

> str(final_daily)
'data.frame':   387 obs. of  10 variables:
 $ Date             : chr  "2014-08-13" "2014-08-14" "2014-08-15" "2014-08-16" ...
 $ MEID.1           : Factor w/ 97 levels "","1","1.1","1.1a",..: NA NA NA 17 24 NA NA NA NA NA ...
 $ MEID.2           : Factor w/ 184 levels "1","100","100.1",..: NA NA NA 143 48 NA NA NA NA NA ...
 $ MEID.3           : Factor w/ 180 levels "100","100.1",..: NA NA NA 24 134 NA NA NA NA NA ...
 $ MEID.4           : Factor w/ 42 levels "173","173a","173b",..: NA NA NA 17 1 NA NA NA NA NA ...
 $ MEID.5           : Factor w/ 3 levels "d1","s1","s2": NA NA NA 2 3 NA NA NA NA NA ...
 $ MEID.6           : Factor w/ 1 level "s2": NA NA NA NA NA NA NA NA NA NA ...
 $ DAYT             : int  NA NA NA 1 8 NA NA NA NA NA ...
 $ DATT             : int  NA NA NA 1 1 NA NA NA NA NA ...
 $ Reason.For.Change: chr  "0" "0" "0" "0" ...

我知道可用于填充缺失数据的实现,例如上一个结转观察值(LOCF)和下一个结转观察值(NOCB).重要的是,缺失的数据空白可能存在的时间只有单个日期,一次最多可能长达数月.

I am aware of the implementations that can be used to fill in missing data such as last observation carried forward (LOCF) and next observation carried backwards (NOCB). Importantly, the missing data gaps can exist for as few as a single date to up to months of days at a time.

我想创建一种插补方法,该方法在缺失时间段的前半段使用LOCF,在缺失时间段的后半段使用NOCB.这对于较大的时间序列差异尤为重要(我不想在2月28日使用饮食摄入量来代表8月2日可用的8月1日).有人可以在这里提出可能的解决方案吗?

I would like to create an imputation method that uses LOCF for the first half of the missing time period and NOCB for the second half of the missing time period. This is more important for large time series gaps (I don't want to use dietary intake on February 28 to be representative for August 1 when August 2 is available). Can anyone suggest a possible solution here?

重要的是,我还有一列(Reason.For.Change),该列应限制插补方法,如 0的那一天开始,并且必须分别估算这些时间序列.

Importantly, I also have a column (Reason.For.Change) which should constrain the imputation methods as in Filling in missing (blanks) in a data table, per category - backwards and forwards. For example, when Reason.For.Change has a value >0, the imputation should recognize this. In other words, Reason.For.Change values >0 denote "different" time series within an individual that starts on the day where Reason.For.Change is >0, and these time series must be imputed separately.

本质上,此列创建两个条件:当记录不可用时,Reason.For.Change为> 0的日期之前的日期只能使用LOCF.其次,由于在Reason.For.Change大于0的同一天没有饮食摄入记录,因此只能使用NOCB. (第二个示例类似于

Essentially, this column creates two conditions: when a record is not available the date prior to a date where Reason.For.Change is >0, only LOCF can be used. Second, since a record of diet intake is not available on the same date that Reason.For.Change is >0, only NOCB can be used. (This second example is analagous to the example in Filling in missing (blanks) in a data table, per category - backwards and forwards where patients are missing 'doctor' on their first visit.)

感谢任何建议/指导以完成以下我总结的内容

    时间序列缺口的输入方法,包括LOCF和 NOCB填补了缺口的第一个和最后一个50%
  1. 1)中的输入方法,用于确认时间序列中的中断 在日期上用值> 0表示,并允许LOCF直到中断日期",NOCB填充回并包括中断日期
  1. Imputation method for time series gaps that includes LOCF and NOCB for the first and last 50% of the gap
  2. Imputation method in 1) that acknowledges breaks in the time series denoted by values >0 on a date and allows for LOCF up-to the 'break-date' and NOCB filling back to and including the break-date

经过更多思考之后, R-执行上次观察向前n次在一个时间序列中仅将NA填充到有限的数量,这似乎在我的问题中朝着解决1)的方向迈出了一步.但是,我想将它们对LOCF n次的使用推广到LOCF来获取长度(丢失数据)/2 ...

After thinking some more, the implementations in R -- Carry last observation forward n times and Fill NA in a time series only to a limited number seem to offer a step in the direction of addressing 1) here in my question. However, I would like to generalize their use of LOCF n-times to LOCF for length(missing data)/2 ...

经过进一步思考,我在数据框中添加了一个新列GAP_DAYS,用于计算缺少时间段(间隔)中的天数.这是添加新列后数据的str().

After thinking even more, I have added a new column in my dataframe, GAP_DAYS, which counts the number of days in the missing time period (gap). Here is str() on the data after the new column was added.

> str(final_daily_intake2)
'data.frame':   387 obs. of  11 variables:
 $ Date             : chr  "2014-08-13" "2014-08-14" "2014-08-15" "2014-08-16" ...
 $ MEID.1           : chr  NA NA NA "14" ...
 $ MEID.2           : Factor w/ 184 levels "1","100","100.1",..: NA NA NA 143 48 NA NA NA NA NA ...
 $ MEID.3           : Factor w/ 180 levels "100","100.1",..: NA NA NA 24 134 NA NA NA NA NA ...
 $ MEID.4           : Factor w/ 42 levels "173","173a","173b",..: NA NA NA 17 1 NA NA NA NA NA ...
 $ MEID.5           : Factor w/ 3 levels "d1","s1","s2": NA NA NA 2 3 NA NA NA NA NA ...
 $ MEID.6           : Factor w/ 1 level "s2": NA NA NA NA NA NA NA NA NA NA ...
 $ DAYT             : int  NA NA NA 1 8 NA NA NA NA NA ...
 $ DATT             : int  NA NA NA 1 1 NA NA NA NA NA ...
 $ Reason.For.Change: chr  "0" "0" "0" "0" ...
 $ GAP_Days         : chr  "1" "2" "3" "NA" ...

我当时认为,这可以用于确定每个间隔期使用LOCF的n天数.例如,在第一个缺失数据时间段中,有3天缺失(因此,对于GAP_Days,str()中的1、2、3).在此示例中,由于天数是奇数,因此我希望LOCF使用round(3 * 0.5)的结果来获得值2,该值将用作LOCF的输入.例如,在较长的时间段中,如果GAP_Days的长度为30,则LOCF将使用round(30 * 0.5)的结果,以便LOCF将使用15天.

I was thinking that this could be used to determine the n number of days to use LOCF on, for each gap period. For example, in the first missing data time period, there are 3 days missing (hence 1, 2, 3, in the str() for GAP_Days). In this example, since it is an odd number of days, I would like LOCF to use the result of round(3 * 0.5) to obtain a value of 2, which would be used as input to LOCF. In a longer time period, for example, where the length of GAP_Days is 30, LOCF would use the result of round(30 * 0.5) such that LOCF would be used for 15 days.

我认为这种方法可用于通过LOCF遍历数据帧一次,然后使用NOCB遍历数据帧一次. (尽管我仍然没有解决需要确认由Reason.For.Change表示的时间序列中的中断的需求). 非常感谢.

I think this approach can be used to go over the dataframe once with LOCF, and then a second time with NOCB. (Although I still haven't addressed the need to acknowledge breaks in the time series denoted by Reason.For.Change). Much thanks.

推荐答案

由于文本很长,我将再次指出问题:

Since the text is very long I'll point out the questions again:

  1. 时间序列间隙的输入方法,其中包括间隙的前50%和后50%的LOCF和NOCB

  1. Imputation method for time series gaps that includes LOCF and NOCB for the first and last 50% of the gap

输入方法,该方法确认日期上由值> 0表示的时间序列中的中断,并允许LOCF直到中断日期",并且NOCB填充回并包括中断日期

Imputation method in 1) that acknowledges breaks in the time series denoted by values >0 on a date and allows for LOCF up-to the 'break-date' and NOCB filling back to and including the break-date

据我所知,没有可用于R的软件包,它可以直接使您执行其中一项任务.

As far as I know, there is no package available for R, which directly enables you to do one of this tasks.

至1): 有很多包含locf选项的软件包:

To 1): There are quite a bunch of packages which contain a locf option:

  • imputeTS :: na.locf()
  • zoo :: na.locf()
  • xts :: na.locf()
  • spacetime :: na.locf()

实际上,您对插补的想法很有意义. 但是,这些软件包都没有针对您请求的行为的选项.您可以做什么,例如Zoo已设置 maxgap 参数.然后保留超过maxgap NA的运行.这意味着您以后可以/必须将它们分开对待. 您将必须自行编写请求的行为.

Indeed your idea for imputation makes pretty much sense. But none of the packages has a option for your requested behavior. What you can do with e.g. zoo is set the maxgap parameter. Runs of more than maxgap NAs are then retained. Which means you can/must treat them separately afterwards. You would have to program your requested behavior on your own.

另一个想法可能是使用这些软件包的其他更高级的功能,它们利用NA间隙的两侧.

Another idea could be using other more advanced function of these packages, that make use of both sides of the NA gaps.

一个例子是imputeTS :: na.ma(),它使用移动平均值(您可以设置窗口大小)来插值.

An example would be imputeTS::na.ma() which imputes the values with an moving average (you can set the window size).

还有更高级的功能,例如

There are also even more advanced functions like

  • imputeTS :: na.kalman()
  • imputeTS :: na.interpolation()
  • forecast :: na.interp()
  • zoo :: na.StructTS()

这些还考虑了日常行为(工作日模式)以及趋势和其他因素.这些问题当然不像locf或ma这样简单的算法那么容易合理.

These also take into account saisonal behavior (weekday patterns) and trend and other things. Problem with these is of course they are not as easy reasonable as the simple algorithms like locf or ma.

至2): 也没有为此的预制功能.这也必须单独编码.

To 2): There is also no premade function for this. This would also have to be coded individually.

这篇关于用LCOF和NOCB方法填补时间序列的空白,但要确认时间序列的中断的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆