按组有条件NA填充 [英] Conditional NA filling by group

查看:80
本文介绍了按组有条件NA填充的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

修改
该问题最初是要求输入data.table的.带有任何包装的解决方案都会很有趣.

edit
The question was originally asked for data.table. A solution with any package would be interesting.

我对一个更普遍的问题的特定变化有些困惑.我有与data.table一起使用的面板数据,我想使用group.data.table的功能来填写一些缺少的值.不幸的是,它们不是数字的,因此我不能简单地插值,而只能根据条件进行填充.是否有可能在data.tables中执行一种条件na.locf?

I am a little stuck with a particular variation of a more general problem. I have panel data that I am using with data.table and I would like to fill in some missing values using the group by functionality of data.table. Unfortunately they are not numeric, so I can't simply interpolate, but they should only be filled in based on a condition. Is it possible to perform a kind of conditional na.locf in data.tables?

基本上,我只想填写NA,前提是在NA之后的下一个观察值是先前的观察值,尽管更普遍的问题是如何有条件地填写NA.

Essentially I only want to fill in the NAs if after the NAs the next observation is the previous ones, though the more general question is how to conditionally fill in NAs.

例如,在以下数据中,我想按每个id组填写associatedid变量.因此,id==1year==2003将作为ABC123填写,因为它在NA之前和之后的值,但对于相同的id则不是2000. id== 2不会更改,因为下一个值与NA之前的值不同. id==3将填写2003年和2004年的内容.

For example, in the following data I would like to fill in the associatedid variable by each id group. So id==1 , year==2003 would fill in as ABC123 because its the value before and after the NA, but not 2000 for the same id. id== 2 would not be changed because the next value is not the same as the one prior to the NAs. id==3 would fill in for 2003 and 2004.

mydf <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L), year = c(2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 2000L, 2001L, 2002L, 2003L, 2004L, 2005L), associatedid = structure(c(NA, 1L, 1L, NA, 1L, 1L, NA, 1L, 1L, NA, 2L, 2L, NA, 1L, 1L, NA, NA, 1L), .Label = c("ABC123", "DEF456"), class = "factor")), class = "data.frame", row.names = c(NA, -18L))

mydf
#>    id year associatedid
#> 1   1 2000         <NA>
#> 2   1 2001       ABC123
#> 3   1 2002       ABC123
#> 4   1 2003         <NA>
#> 5   1 2004       ABC123
#> 6   1 2005       ABC123
#> 7   2 2000         <NA>
#> 8   2 2001       ABC123
#> 9   2 2002       ABC123
#> 10  2 2003         <NA>
#> 11  2 2004       DEF456
#> 12  2 2005       DEF456
#> 13  3 2000         <NA>
#> 14  3 2001       ABC123
#> 15  3 2002       ABC123
#> 16  3 2003         <NA>
#> 17  3 2004         <NA>
#> 18  3 2005       ABC123

dt = data.table(mydf, key = c("id"))

所需的输出

#>    id year associatedid
#> 1   1 2000         <NA>
#> 2   1 2001       ABC123
#> 3   1 2002       ABC123
#> 4   1 2003       ABC123
#> 5   1 2004       ABC123
#> 6   1 2005       ABC123
#> 7   2 2000         <NA>
#> 8   2 2001       ABC123
#> 9   2 2002       ABC123
#> 10  2 2003         <NA>
#> 11  2 2004       DEF456
#> 12  2 2005       DEF456
#> 13  3 2000         <NA>
#> 14  3 2001       ABC123
#> 15  3 2002       ABC123
#> 16  3 2003       ABC123
#> 17  3 2004       ABC123
#> 18  3 2005       ABC123

推荐答案

这都是关于编写经过修改的na.locf函数的.之后,您可以将其插入到data.table中,就像其他任何函数一样.

This is all about writing a modified na.locf function. After that you can plug it into data.table like any other function.

new.locf <- function(x){
  # might want to think about the end of this loop
  # this works here but you might need to add another case
  # if there are NA's as the last value.
  #
  # anyway, loop through observations in a vector, x.
  for(i in 2:(length(x)-1)){
    nextval = i
    # find the next, non-NA value
    # again, not tested but might break if there isn't one?
    while(nextval <= length(x)-1 & is.na(x[nextval])){
      nextval = nextval + 1
    }
    # if the current value is not NA, great!
    if(!is.na(x[i])){
      x[i] <- x[i]
    }else{
      # if the current value is NA, and the last value is a value
      # (should given the nature of this loop), and
      # the next value, as calculated above, is the same as the last
      # value, then give us that value. 
      if(is.na(x[i]) & !is.na(x[i-1]) & x[i-1] == x[nextval]){
        x[i] <- x[nextval]
      }else{
        # finally, return NA if neither of these conditions hold
        x[i] <- NA
      }
    }
  }
  # return the new vector
  return(x) 
}

一旦有了该功能,便可以照常使用data.table:

Once we have that function, we can use data.table as usual:

dt2 <- dt[,list(year = year,
                # when I read your data in, associatedid read as factor
                associatedid = new.locf(as.character(associatedid))
                ),
          by = "id"
          ]

这将返回:

> dt2
    id year associatedid
 1:  1 2000           NA
 2:  1 2001       ABC123
 3:  1 2002       ABC123
 4:  1 2003       ABC123
 5:  1 2004       ABC123
 6:  1 2005       ABC123
 7:  2 2000           NA
 8:  2 2001       ABC123
 9:  2 2002       ABC123
10:  2 2003           NA
11:  2 2004       DEF456
12:  2 2005       DEF456
13:  3 2000           NA
14:  3 2001       ABC123
15:  3 2002       ABC123
16:  3 2003       ABC123
17:  3 2004       ABC123
18:  3 2005       ABC123

据我所知,这正是您所寻找的.

which is what you are looking for as best I understand it.

我在new.locf定义中提供了一些套期保值,因此您可能仍需要考虑一下,但这应该可以帮助您入门.

I provided some hedging in the new.locf definition so you still might have a little thinking to do but this should get you started.

这篇关于按组有条件NA填充的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆