按组有条件NA填充 [英] Conditional NA filling by group
问题描述
修改
该问题最初是要求输入data.table
的.带有任何包装的解决方案都会很有趣.
edit
The question was originally asked for data.table
. A solution with any package would be interesting.
我对一个更普遍的问题的特定变化有些困惑.我有与data.table一起使用的面板数据,我想使用group.data.table的功能来填写一些缺少的值.不幸的是,它们不是数字的,因此我不能简单地插值,而只能根据条件进行填充.是否有可能在data.tables中执行一种条件na.locf?
I am a little stuck with a particular variation of a more general problem. I have panel data that I am using with data.table and I would like to fill in some missing values using the group by functionality of data.table. Unfortunately they are not numeric, so I can't simply interpolate, but they should only be filled in based on a condition. Is it possible to perform a kind of conditional na.locf in data.tables?
基本上,我只想填写NA,前提是在NA之后的下一个观察值是先前的观察值,尽管更普遍的问题是如何有条件地填写NA.
Essentially I only want to fill in the NAs if after the NAs the next observation is the previous ones, though the more general question is how to conditionally fill in NAs.
例如,在以下数据中,我想按每个id组填写associatedid变量.因此,id==1
,year==2003
将作为ABC123
填写,因为它在NA之前和之后的值,但对于相同的id则不是2000. id== 2
不会更改,因为下一个值与NA之前的值不同. id==3
将填写2003年和2004年的内容.
For example, in the following data I would like to fill in the associatedid variable by each id group. So id==1
, year==2003
would fill in as ABC123
because its the value before and after the NA, but not 2000 for the same id. id== 2
would not be changed because the next value is not the same as the one prior to the NAs. id==3
would fill in for 2003 and 2004.
mydf <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L), year = c(2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 2000L, 2001L, 2002L, 2003L, 2004L, 2005L), associatedid = structure(c(NA, 1L, 1L, NA, 1L, 1L, NA, 1L, 1L, NA, 2L, 2L, NA, 1L, 1L, NA, NA, 1L), .Label = c("ABC123", "DEF456"), class = "factor")), class = "data.frame", row.names = c(NA, -18L))
mydf
#> id year associatedid
#> 1 1 2000 <NA>
#> 2 1 2001 ABC123
#> 3 1 2002 ABC123
#> 4 1 2003 <NA>
#> 5 1 2004 ABC123
#> 6 1 2005 ABC123
#> 7 2 2000 <NA>
#> 8 2 2001 ABC123
#> 9 2 2002 ABC123
#> 10 2 2003 <NA>
#> 11 2 2004 DEF456
#> 12 2 2005 DEF456
#> 13 3 2000 <NA>
#> 14 3 2001 ABC123
#> 15 3 2002 ABC123
#> 16 3 2003 <NA>
#> 17 3 2004 <NA>
#> 18 3 2005 ABC123
dt = data.table(mydf, key = c("id"))
所需的输出
#> id year associatedid
#> 1 1 2000 <NA>
#> 2 1 2001 ABC123
#> 3 1 2002 ABC123
#> 4 1 2003 ABC123
#> 5 1 2004 ABC123
#> 6 1 2005 ABC123
#> 7 2 2000 <NA>
#> 8 2 2001 ABC123
#> 9 2 2002 ABC123
#> 10 2 2003 <NA>
#> 11 2 2004 DEF456
#> 12 2 2005 DEF456
#> 13 3 2000 <NA>
#> 14 3 2001 ABC123
#> 15 3 2002 ABC123
#> 16 3 2003 ABC123
#> 17 3 2004 ABC123
#> 18 3 2005 ABC123
推荐答案
这都是关于编写经过修改的na.locf函数的.之后,您可以将其插入到data.table中,就像其他任何函数一样.
This is all about writing a modified na.locf function. After that you can plug it into data.table like any other function.
new.locf <- function(x){
# might want to think about the end of this loop
# this works here but you might need to add another case
# if there are NA's as the last value.
#
# anyway, loop through observations in a vector, x.
for(i in 2:(length(x)-1)){
nextval = i
# find the next, non-NA value
# again, not tested but might break if there isn't one?
while(nextval <= length(x)-1 & is.na(x[nextval])){
nextval = nextval + 1
}
# if the current value is not NA, great!
if(!is.na(x[i])){
x[i] <- x[i]
}else{
# if the current value is NA, and the last value is a value
# (should given the nature of this loop), and
# the next value, as calculated above, is the same as the last
# value, then give us that value.
if(is.na(x[i]) & !is.na(x[i-1]) & x[i-1] == x[nextval]){
x[i] <- x[nextval]
}else{
# finally, return NA if neither of these conditions hold
x[i] <- NA
}
}
}
# return the new vector
return(x)
}
一旦有了该功能,便可以照常使用data.table:
Once we have that function, we can use data.table as usual:
dt2 <- dt[,list(year = year,
# when I read your data in, associatedid read as factor
associatedid = new.locf(as.character(associatedid))
),
by = "id"
]
这将返回:
> dt2
id year associatedid
1: 1 2000 NA
2: 1 2001 ABC123
3: 1 2002 ABC123
4: 1 2003 ABC123
5: 1 2004 ABC123
6: 1 2005 ABC123
7: 2 2000 NA
8: 2 2001 ABC123
9: 2 2002 ABC123
10: 2 2003 NA
11: 2 2004 DEF456
12: 2 2005 DEF456
13: 3 2000 NA
14: 3 2001 ABC123
15: 3 2002 ABC123
16: 3 2003 ABC123
17: 3 2004 ABC123
18: 3 2005 ABC123
据我所知,这正是您所寻找的.
which is what you are looking for as best I understand it.
我在new.locf定义中提供了一些套期保值,因此您可能仍需要考虑一下,但这应该可以帮助您入门.
I provided some hedging in the new.locf definition so you still might have a little thinking to do but this should get you started.
这篇关于按组有条件NA填充的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!