data.table用多列均值和id替换NA [英] data.table replace NA with mean for multiple columns and by id

查看:98
本文介绍了data.table用多列均值和id替换NA的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果我有以下data.table:

If I have the following data.table:

dat <- data.table("id"=c(1,1,1,1,2,2,2,2), "var1"=c(NA,1,2,2,1,1,2,2),
              "var2"=c(4,4,4,4,5,5,NA,4), "var3"=c(4,4,4,NA,5,5,5,4))
   id var1 var2 var3
1:  1   NA    4    4
2:  1    1    4    4
3:  1    2    4    4
4:  1    2    4   NA
5:  2    1    5    5
6:  2    1    5    5
7:  2    2   NA    5
8:  2    2    4    4

如何用ID中每列的平均值替换缺失值?在我的实际数据中,我有很多变量,这些变量仅是我想替换的变量,因此如何以一种通用的方式进行处理,例如,它不会被var3取代而只能被var1和var2取代?:

How can I replace the missing values with the mean of each column within id? In my actual data I have many variables which for only ones I wish to replace so how could be done in a general way so that for example it is not replaced for var3 but only to var1 and var2?:

tomean=c("var1", "var2")

我尝试过类似的方法,但没有找到解决方法:

I tried something like this but I haven't found a solution:

dat[, (tomean) := mean(tomean, na.rm=TRUE), by=id, .SDcols = tomean]


推荐答案

要仅使用列名来评估列,我们可以使用 get()。而且我们将需要 lapply()在多列上执行此操作。

To evaluate the columns with only the column names, we can use get(). And we are going to need lapply() to perform this operation over multiple columns.

## determine the column names that contain NA values
nm <- names(dat)[colSums(is.na(dat)) != 0]
## replace with the mean - by 'id'
dat[, (nm) := lapply(nm, function(x) {
    x <- get(x)
    x[is.na(x)] <- mean(x, na.rm = TRUE)
    x
}), by = id]

给出更新的 dat

   id     var1     var2 var3
1:  1 1.666667 4.000000    4
2:  1 1.000000 4.000000    4
3:  1 2.000000 4.000000    4
4:  1 2.000000 4.000000    3
5:  2 1.000000 5.000000    5
6:  2 1.000000 5.000000    5
7:  2 2.000000 4.666667    5
8:  2 2.000000 4.000000    4

更新:对于您的更新问题,为避免在所有包含NA的列上运行此问题,请不要使用 nm 。只需使用自己的向量 tomean

Update: With your updated question, to avoid running this over all columns that contain NA, don't use nm. Just use your own vector tomean.

tomean <- c("var1", "var2")
dat[, (tomean) := lapply(tomean, function(x) {
    x <- get(x)
    x[is.na(x)] <- mean(x, na.rm = TRUE)
    x
}), by = id]

这给出

   id     var1     var2 var3
1:  1 1.666667 4.000000    4
2:  1 1.000000 4.000000    4
3:  1 2.000000 4.000000    4
4:  1 2.000000 4.000000   NA
5:  2 1.000000 5.000000    5
6:  2 1.000000 5.000000    5
7:  2 2.000000 4.666667    5
8:  2 2.000000 4.000000    4

这篇关于data.table用多列均值和id替换NA的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆