在data.table中按组填写缺失值 [英] Fill in missing values by group in data.table

查看:20
本文介绍了在data.table中按组填写缺失值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果要根据组内的先前/后验非 NA 观察来填充变量的缺失值,则 data.table 命令是

If one wants to fill in missing values of a variable based on previous/posterior non NA observation within a group, the data.table command is

setkey(DT,id,date)
DT[, value_filled_in := DT[!is.na(value), list(id, date, value)][DT[, list(id, date)], value, roll = TRUE]]

这是相当复杂的.很遗憾,因为 roll 是一个非常快速且强大的选项(尤其是与在每个组中应用诸如 zoo::na.locf 之类的函数相比)

which is quite complex. It's a shame since roll is a very fast and powerful option (esp compared with applying a function such as zoo::na.locf within each group)

我可以写一个方便的函数来填补缺失值

I can write a convenience function to fill in missing values

   fill_na <-  function(x , by = NULL, roll =TRUE , rollends= if (roll=="nearest") c(TRUE,TRUE)
             else if (roll>=0) c(FALSE,TRUE)
             else c(TRUE,FALSE)){
    id <- seq_along(x)
    if (is.null(by)){
      DT <- data.table("x" = x, "id" = id, key = "id") 
      return(DT[!is.na(x)][DT[, list(id)], x, roll = roll, rollends = rollends, allow.cartesian = TRUE])

    } else{
      DT <- data.table("x" = x, "by" = by, "id" = id, key = c("by", "id")) 
      return(DT[!is.na(x)][DT[, list(by, id)], x, roll = roll, rollends = rollends, allow.cartesian = TRUE])
    }
  }

然后写

setkey(DT,id, date)
DT[, value_filled_in := fill_na(value, by = id)]

这不是很令人满意,因为一个人想写

This is not really satisfying since one would like to write

setkey(DT,id, date)
DT[, value_filled_in := fill_na(value), by = id]

但是,这需要大量时间来运行.而且,对于最终用户来说,知道应该使用 by 选项调用 fill_na 并且不应该使用 data.table .有没有一个优雅的解决方案?

However, this takes a huge amount of time to run. And, for the end-user, it is cumbersome to learn that fill_na should be called with the by option, and should not be used with data.table by. Is there an elegant solution around this?

一些速度测试

N <- 2e6
set.seed(1)
DT <- data.table(
         date = sample(10, N, TRUE),
           id = sample(1e5, N, TRUE),   
        value = sample(c(NA,1:5), N, TRUE),
       value2 = sample(c(NA,1:5), N, TRUE)                   
      )
setkey(DT,id,date)
DT<- unique(DT)

system.time(DT[, filled0 := DT[!is.na(value), list(id, date, value)][DT[, list(id, date)], value, roll = TRUE]])
#> user  system elapsed 
#>  0.086   0.006   0.105 
system.time(DT[, filled1 := zoo::na.locf.default(value, na.rm = FALSE), by = id])
#> user  system elapsed 
#> 5.235   0.016   5.274 
# (lower speed and no built in option like roll=integer or roll=nearest, rollend, etc)
system.time(DT[, filled2 := fill_na(value, by = id)])
#>   user  system elapsed 
#>  0.194   0.019   0.221 
system.time(DT[, filled3 := fill_na(value), by = id])
#>    user  system elapsed 
#> 237.256   0.913 238.405 

我为什么不直接使用 na.locf.default ?尽管速度差异并不重要,但对于其他类型的 data.table 命令(那些依赖于by"中的变量合并的命令)也会出现同样的问题 - 系统地忽略它们以获得更简单的语法.我也很喜欢所有的滚动选项.

Why don't I just use na.locf.default ? Even though the speed difference is not really important, the same issue arises for other kinds of data.table commands (those that rely on a merge by the variable in "by") - it's a shame to systematically ignore them in order to get an easier syntax. I also really like all the roll options.

推荐答案

现在有一个原生的 data.table 方法来填充缺失值(从 1.12.4).

There is now a native data.table way of filling missing values (as of 1.12.4).

这个问题催生了一个 github 问题,该问题最近在创建时关闭nafillsetnafill 函数.您现在可以使用

This question spawned a github issue which was recently closed with the creation of functions nafill and setnafill. You can now use

DT[, value_filled_in := nafill(value, type = "locf")]

也可以用一个常数值或返回的下一个观察值填充 NA.

It is also possible to fill NA with a constant value or next observation carried back.

问题中方法的一个区别是这些函数目前仅适用于 NA 而不是 NaNis.naTRUE for NaN - 这是 计划中的 将在下一个版本中通过一个额外的参数进行修复.

One difference to the approach in the question is that these functions currently only work on NA not NaN whereas is.na is TRUE for NaN - this is planned to be fixed in the next release through an extra argument.

我没有参与该项目,但我看到虽然 github issue 链接在这里,但没有其他链接,所以我代表未来的访问者回答.

I have no involvement with the project but I saw that although the github issue links here, there was no link the other way so I'm answering on behalf of future visitors.

更新:默认情况下 NaN 现在被视为与 NA 相同.

Update: By default NaN is now treated same as NA.

这篇关于在data.table中按组填写缺失值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆