使用值替换R数据表中的所有缺失值 [英] Replacing all missing values in R data.table with a value
问题描述
如果你有一个R data.table有缺失值,你如何替换所有的值,比如说,值0? Eg
If you have an R data.table that has missing values, how do you replace all of them with say, the value 0? E.g.
aa = data.table(V1=1:10,V2=c(1,2,2,3,3,3,4,4,4,4))
bb = data.table(V1=3:6,X=letters[1:4])
setkey(aa,V1)
setkey(bb,V1)
tt = bb[aa]
V1 X V2
1: 1 NA 1
2: 2 NA 2
3: 3 a 2
4: 4 b 3
5: 5 c 3
6: 6 d 3
7: 7 NA 4
8: 8 NA 4
9: 9 NA 4
10: 10 NA 4
这一行在一行?如果它只是一个矩阵,你可以这样做:
Any way to do this in one line? If it were just a matrix, you could just do:
tt[is.na(tt)] = 0
推荐答案
is.na
(作为原语)具有相对非常少的开销,并且通常相当快。所以,你可以循环通过列,并使用 set
将 NA替换为
0'。
is.na
(being a primitive) has relatively very less overhead and is usually quite fast. So, you can just loop through the columns and use set
to replace NA with
0`.
使用< -
分配会产生所有列的副本,这不是惯用的方式使用 data.table
。
Using <-
to assign will result in a copy of all the columns and this is not the idiomatic way using data.table
.
首先我将演示如何做,然后显示如何慢这可以获得巨大的数据(由于副本) :
First I'll illustrate as to how to do it and then show how slow this can get on huge data (due to the copy):
for (i in seq_along(tt)) set(tt, i=which(is.na(tt[[i]])), j=i, value=0)
你会得到一个警告,0被强制转换为字符匹配列的类型。您可以忽略它。
You'll get a warning here that "0" is being coerced to character to match the type of column. You can ignore it.
# by reference - idiomatic way
set.seed(45)
tt <- data.table(matrix(sample(c(NA, rnorm(10)), 1e7*3, TRUE), ncol=3))
tracemem(tt)
# modifies value by reference - no copy
system.time({
for (i in seq_along(tt))
set(tt, i=which(is.na(tt[[i]])), j=i, value=0)
})
# user system elapsed
# 0.284 0.083 0.386
# by copy - NOT the idiomatic way
set.seed(45)
tt <- data.table(matrix(sample(c(NA, rnorm(10)), 1e7*3, TRUE), ncol=3))
tracemem(tt)
# makes copy
system.time({tt[is.na(tt)] <- 0})
# a bunch of "tracemem" output showing the copies being made
# user system elapsed
# 4.110 0.976 5.187
这篇关于使用值替换R数据表中的所有缺失值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!