趋势的长度 - 面板数据 [英] Length of Trend - Panel Data

查看:148
本文介绍了趋势的长度 - 面板数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个良好平衡的面板数据集,其中包含NA观察。我将使用LOCF,并想知道在每个面板中有多少连续的NA,在进行观察前。 LOCF是一种程序,其中通过使用最后观察结转可以填充缺失值。这可能是有意义的一些时间序列应用程序;也许我们有5分钟增量的天气数据:一个缺失的观测值的一个好的猜测可能是5分钟前做的观察。



显然,它更有意义在一个小组内向前观察一小时,比在同一小组内向下一年进行同样的观察。



我知道您可以设置使用zoo :: na.locf的maxgap参数,但是,我想为我的数据获得更好的感觉。请看一个简单的例子:

  require(data.table)
set.seed(12345)

###创建一个面板数据集
data< - data.table(id = rep(1:10,each = 10),
date = seq(as.POSIXct ('2012-01-01'),
as.POSIXct('2012-01-10'),
by ='1 day'),
x = runif(100) b
$ b ###随机分配NA到我们的x变量
na< - sample(1:100,size = 52)
data [na,x:

###按组计算连续NA的最大数量...这是我想要的:
### ID连续NA的
#1 1
#2 3
#3 3
#4 3
#5 4
#6 5
#...
#10 2

###按组计算NA的总数...这是我得到的:
data [is.na(x),.N,by = id]

欢迎所有解决方案,但data.table解决方案非常受欢迎;



<$>

解决方案

p $ p> data [,max(with(rle(is.na(x)),lengths [values])),by = id]
/ pre>

我只是运行 rle 找到所有连续的 NA 并选择最大长度。






这是一个相当复杂的答案,对于上述 max

  data [,{
tmp = rle(is.na(x));
tmp $ lengths [!tmp $ values] = 0; #modify rle result to ignore non-NA's
n = which.max(tmp $ lengths); #find the index in rle of longest NA sequence

tmp = rle(is.na(x)); #let's get back to the unmodified rle
start = sum(tmp $ lengths [0:(n-1)])+ 1; #并找到开始和结束索引
end = sum(tmp $ lengths [1:n]);

list(date [start],date [end],max(tmp $ lengths [tmp $ values]))
},by = id]


I have a well balanced panel data set which contains NA observations. I will be using LOCF, and would like to know how many consecutive NA's are in each panel, before carrying observations forward. LOCF is a procedure where by missing values can be "filled in" using the "last observation carried forward". This can make sense it some time-series applications; perhaps we have weather data in 5 minute increments: a good guess at the value of a missing observation might be an observation made 5 minutes earlier.

Obviously, it makes more sense to carry an observation forward one hour within one panel than it does to carry that same observation forward to the next year in the same panel.

I am aware that you can set a "maxgap" argument using zoo::na.locf, however, I want to get a better feel for my data. Please see a simple example:

require(data.table)
set.seed(12345)

### Create a "panel" data set
data <- data.table(id = rep(1:10, each = 10),
                   date = seq(as.POSIXct('2012-01-01'),
                              as.POSIXct('2012-01-10'),
                              by = '1 day'),
                   x  = runif(100))

### Randomly assign NA's to our "x" variable
na <- sample(1:100, size = 52)
data[na, x := NA]

### Calculate the max number of consecutive NA's by group...this is what I want:
### ID       Consecutive NA's
  #  1       1
  #  2       3
  #  3       3
  #  4       3
  #  5       4
  #  6       5
  #  ...
  #  10      2

### Count the total number of NA's by group...this is as far as I get:
data[is.na(x), .N, by = id]

All solutions are welcomed, but data.table solutions are highly preferred; the data file is large.

解决方案

This will do it:

data[, max(with(rle(is.na(x)), lengths[values])), by = id]

I just ran rle to find all consecutive NA's and picked the max length.


Here's a rather convoluted answer to the comment question of recovering the date ranges for the above max:

data[, {
         tmp = rle(is.na(x));
         tmp$lengths[!tmp$values] = 0;  # modify rle result to ignore non-NA's
         n = which.max(tmp$lengths);    # find the index in rle of longest NA sequence

         tmp = rle(is.na(x));                   # let's get back to the unmodified rle
         start = sum(tmp$lengths[0:(n-1)]) + 1; # and find the start and end indices
         end   = sum(tmp$lengths[1:n]);

         list(date[start], date[end], max(tmp$lengths[tmp$values]))
       }, by = id]

这篇关于趋势的长度 - 面板数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆