NA在data.table中 [英] NA in data.table
问题描述
我有一个包含一些组的 data.table
。我操作每个组,一些组返回数字,其他返回 NA
。由于某些原因 data.table
无法将所有内容放回一起。这是一个错误还是我误会?这是一个例子:
I have a data.table
that contains some groups. I operate on each group and some groups return numbers, others return NA
. For some reason data.table
has trouble putting everything back together. Is this a bug or am I misunderstanding? Here is an example:
dtb <- data.table(a=1:10)
f <- function(x) {if (x==9) {return(NA)} else { return(x)}}
dtb[,f(a),by=a]
Error in `[.data.table`(dtb, , f(a), by = a) :
columns of j don't evaluate to consistent types for each group: result for group 9 has column 1 type 'logical' but expecting type 'integer'
我的理解是 code>与R中的数字兼容,因为我们可以有一个
data.table
,它有 NA
。我意识到我可以返回 NULL
,这将工作正常,但问题是与 NA
。
My understanding was that NA
is compatible with numbers in R since clearly we can have a data.table
that has NA
values. I realize I can return NULL
and that will work fine but the issue is with NA
.
推荐答案
从?NA
NA是长度为1的逻辑常数,包含一个缺失值指示符。 NA可以强制为除raw之外的任何其他向量类型。还有支持缺失值的其他原子向量类型的常数NA_integer_,NA_real_,NA_complex_和NA_character_:所有这些都是R语言的保留字。
NA is a logical constant of length 1 which contains a missing value indicator. NA can be coerced to any other vector type except raw. There are also constants NA_integer_, NA_real_, NA_complex_ and NA_character_ of the other atomic vector types which support missing values: all of these are reserved words in the R language.
您必须为您的函数指定正确的类型 -
You will have to specify the correct type for your function to work -
您可以强制在函数中匹配 x
(注意,我们需要任何
才能在子集中有超过1行的情况下工作!
You can coerce within the function to match the type of x
(note we need any
for this to work for situations with more than 1 row in a subset!
f <- function(x) {if any((x==9)) {return(as(NA, class(x)))} else { return(x)}}
更多data.table * ish *方法
这可能会使更多的data.table感觉使用 set
(或:=
More data.table*ish* approach
It might make more data.table sense to use set
(or :=
) to set / replace by reference.
set(dtb, i = which(dtb[,a]==9), j = 'a', value=NA_integer_)
或:= / code>在
中使用向量扫描
a == 9
dtb[a == 9, a := NA_integer_]
或:=
以及二进制搜索
setkeyv(dtb, 'a')
dtb[J(9), a := NA_integer_]
b $ b
有用的注释
如果使用:=
或设置
方法,您似乎不需要指定 NA
类型
Useful to note
If you use the :=
or set
approaches, you don't appear to need to specify the NA
type
以下两种方式都将工作
dtb <- data.table(a=1:10)
setkeyv(dtb,'a')
dtb[a==9,a := NA]
dtb <- data.table(a=1:10)
setkeyv(dtb,'a')
set(dtb, which(dtb[,a] == 9), 'a', NA)
这是一个非常有用的错误讯息,可让您了解原因和解决方案:
This gives a very useful error message that lets you know the reason and solution:
[。data.table
(DTc,J(9),:=
(a,NA)):
RHS的类型('logical')必须与LHS('integer')匹配。检查和强制会对最快的情况影响性能太多。更改目标列的类型,或强制:自己的RHS(例如,使用1L而不是1)
Error in
[.data.table
(DTc, J(9),:=
(a, NA)) : Type of RHS ('logical') must match LHS ('integer'). To check and coerce would impact performance too much for the fastest cases. Either change the type of the target column, or coerce the RHS of := yourself (e.g. by using 1L instead of 1)
这是最快的
有一个合理的大数据.set a
原位替换
library(data.table)
set.seed(1)
n <- 1e+07
DT <- data.table(a = sample(15, n, T))
setkeyv(DT, "a")
DTa <- copy(DT)
DTb <- copy(DT)
DTc <- copy(DT)
DTd <- copy(DT)
DTe <- copy(DT)
f <- function(x) {
if (any(x == 9)) {
return(as(NA, class(x)))
} else {
return(x)
}
}
system.time({DT[a == 9, `:=`(a, NA_integer_)]})
## user system elapsed
## 0.95 0.24 1.20
system.time({DTa[a == 9, `:=`(a, NA)]})
## user system elapsed
## 0.74 0.17 1.00
system.time({DTb[J(9), `:=`(a, NA_integer_)]})
## user system elapsed
## 0.02 0.00 0.02
system.time({set(DTc, which(DTc[, a] == 9), j = "a", value = NA)})
## user system elapsed
## 0.49 0.22 0.67
system.time({set(DTc, which(DTd[, a] == 9), j = "a", value = NA_integer_)})
## user system elapsed
## 0.54 0.06 0.58
system.time({DTe[, `:=`(a, f(a)), by = a]})
## user system elapsed
## 0.53 0.12 0.66
# The are all the same!
all(identical(DT, DTa), identical(DT, DTb), identical(DT, DTc), identical(DT,
DTd), identical(DT, DTe))
## [1] TRUE
毫不奇怪,二进制搜索方法速度最快
Unsurprisingly the binary search approach is the fastest
这篇关于NA在data.table中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!