在data.table中按组填写缺少的值 [英] Fill in missing values by group in data.table
问题描述
如果希望根据组内的先前/后期非NA观察填充变量的缺失值,则data.table命令为
If one wants to fill in missing values of a variable based on previous/posterior non NA observation within a group, the data.table command is
setkey(DT,id,date)
DT[, value_filled_in := DT[!is.na(value), list(id, date, value)][DT[, list(id, date)], value, roll = TRUE]]
这是一个耻辱,因为 roll
是一个非常快速和强大的选项(esp相比,应用一个函数如 zoo :: na.locf
每组内)
which is quite complex. It's a shame since roll
is a very fast and powerful option (esp compared with applying a function such as zoo::na.locf
within each group)
我可以写一个便利函数来填充缺失值
I can write a convenience function to fill in missing values
fill_na <- function(x , by = NULL, roll =TRUE , rollends= if (roll=="nearest") c(TRUE,TRUE)
else if (roll>=0) c(FALSE,TRUE)
else c(TRUE,FALSE)){
id <- seq_along(x)
if (is.null(by)){
DT <- data.table("x" = x, "id" = id, key = "id")
return(DT[!is.na(x)][DT[, list(id)], x, roll = roll, rollends = rollends, allow.cartesian = TRUE])
} else{
DT <- data.table("x" = x, "by" = by, "id" = id, key = c("by", "id"))
return(DT[!is.na(x)][DT[, list(by, id)], x, roll = roll, rollends = rollends, allow.cartesian = TRUE])
}
}
$ b b
然后写
And then write
setkey(DT,id, date)
DT[, value_filled_in := fill_na(value, by = id)]
写入
setkey(DT,id, date)
DT[, value_filled_in := fill_na(value), by = id]
但是,这需要大量的时间运行。对于最终用户,学习使用 by
选项调用 fill_na
很麻烦,不应与 data.table
通过
一起使用。有没有一个优雅的解决方案?
However, this takes a huge amount of time to run. And, for the end-user, it is cumbersome to learn that fill_na
should be called with the by
option, and should not be used with data.table
by
. Is there an elegant solution around this?
一些速度测试
N <- 2e6
set.seed(1)
DT <- data.table(
date = sample(10, N, TRUE),
id = sample(1e5, N, TRUE),
value = sample(c(NA,1:5), N, TRUE),
value2 = sample(c(NA,1:5), N, TRUE)
)
setkey(DT,id,date)
DT<- unique(DT)
system.time(DT[, filled0 := DT[!is.na(value), list(id, date, value)][DT[, list(id, date)], value, roll = TRUE]])
#> user system elapsed
#> 0.086 0.006 0.105
system.time(DT[, filled1 := zoo::na.locf.default(value, na.rm = FALSE), by = id])
#> user system elapsed
#> 5.235 0.016 5.274
# (lower speed and no built in option like roll=integer or roll=nearest, rollend, etc)
system.time(DT[, filled2 := fill_na(value, by = id)])
#> user system elapsed
#> 0.194 0.019 0.221
system.time(DT[, filled3 := fill_na(value), by = id])
#> user system elapsed
#> 237.256 0.913 238.405
为什么我不使用 na.locf.default
?即使速度差异不是很重要,同样的问题出现在其他种类的data.table命令(依靠by中的变量的合并) - 它是一个耻辱,系统地忽略它们,以获得一个更容易的语法。
Why don't I just use na.locf.default
? Even though the speed difference is not really important, the same issue arises for other kinds of data.table commands (those that rely on a merge by the variable in "by") - it's a shame to systematically ignore them in order to get an easier syntax. I also really like all the roll options.
推荐答案
这里有一个更快,更紧凑的方式+):
Here's a slightly faster and more compact way of doing it (version 1.9.3+):
DT[, filled4 := DT[!is.na(value)][DT, value, roll = T]]
这篇关于在data.table中按组填写缺少的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!