data.table用多列均值和id替换NA [英] data.table replace NA with mean for multiple columns and by id
问题描述
如果我有以下data.table:
If I have the following data.table:
dat <- data.table("id"=c(1,1,1,1,2,2,2,2), "var1"=c(NA,1,2,2,1,1,2,2),
"var2"=c(4,4,4,4,5,5,NA,4), "var3"=c(4,4,4,NA,5,5,5,4))
id var1 var2 var3
1: 1 NA 4 4
2: 1 1 4 4
3: 1 2 4 4
4: 1 2 4 NA
5: 2 1 5 5
6: 2 1 5 5
7: 2 2 NA 5
8: 2 2 4 4
如何用ID中每列的平均值替换缺失值?在我的实际数据中,我有很多变量,这些变量仅是我想替换的变量,因此如何以一种通用的方式进行处理,例如,它不会被var3取代而只能被var1和var2取代?:
How can I replace the missing values with the mean of each column within id? In my actual data I have many variables which for only ones I wish to replace so how could be done in a general way so that for example it is not replaced for var3 but only to var1 and var2?:
tomean=c("var1", "var2")
我尝试过类似的方法,但没有找到解决方法:
I tried something like this but I haven't found a solution:
dat[, (tomean) := mean(tomean, na.rm=TRUE), by=id, .SDcols = tomean]
推荐答案
要仅使用列名来评估列,我们可以使用 get()
。而且我们将需要 lapply()
在多列上执行此操作。
To evaluate the columns with only the column names, we can use get()
. And we are going to need lapply()
to perform this operation over multiple columns.
## determine the column names that contain NA values
nm <- names(dat)[colSums(is.na(dat)) != 0]
## replace with the mean - by 'id'
dat[, (nm) := lapply(nm, function(x) {
x <- get(x)
x[is.na(x)] <- mean(x, na.rm = TRUE)
x
}), by = id]
给出更新的 dat
id var1 var2 var3
1: 1 1.666667 4.000000 4
2: 1 1.000000 4.000000 4
3: 1 2.000000 4.000000 4
4: 1 2.000000 4.000000 3
5: 2 1.000000 5.000000 5
6: 2 1.000000 5.000000 5
7: 2 2.000000 4.666667 5
8: 2 2.000000 4.000000 4
更新:对于您的更新问题,为避免在所有包含NA的列上运行此问题,请不要使用 nm
。只需使用自己的向量 tomean
。
Update: With your updated question, to avoid running this over all columns that contain NA, don't use nm
. Just use your own vector tomean
.
tomean <- c("var1", "var2")
dat[, (tomean) := lapply(tomean, function(x) {
x <- get(x)
x[is.na(x)] <- mean(x, na.rm = TRUE)
x
}), by = id]
这给出
id var1 var2 var3
1: 1 1.666667 4.000000 4
2: 1 1.000000 4.000000 4
3: 1 2.000000 4.000000 4
4: 1 2.000000 4.000000 NA
5: 2 1.000000 5.000000 5
6: 2 1.000000 5.000000 5
7: 2 2.000000 4.666667 5
8: 2 2.000000 4.000000 4
这篇关于data.table用多列均值和id替换NA的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!