NA在data.table中 [英] NA in data.table

查看:167
本文介绍了NA在data.table中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含一些组的 data.table 。我操作每个组,一些组返回数字,其他返回 NA 。由于某些原因 data.table 无法将所有内容放回一起。这是一个错误还是我误会?这是一个例子:

I have a data.table that contains some groups. I operate on each group and some groups return numbers, others return NA. For some reason data.table has trouble putting everything back together. Is this a bug or am I misunderstanding? Here is an example:

dtb <- data.table(a=1:10)
f <- function(x) {if (x==9) {return(NA)} else { return(x)}}
dtb[,f(a),by=a]

Error in `[.data.table`(dtb, , f(a), by = a) : 
  columns of j don't evaluate to consistent types for each group: result for group 9 has     column 1 type 'logical' but expecting type 'integer'

我的理解是 code>与R中的数字兼容,因为我们可以有一个 data.table ,它有 NA 。我意识到我可以返回 NULL ,这将工作正常,但问题是与 NA

My understanding was that NA is compatible with numbers in R since clearly we can have a data.table that has NA values. I realize I can return NULL and that will work fine but the issue is with NA.

推荐答案

?NA


NA是长度为1的逻辑常数,包含一个缺失值指示符。 NA可以强制为除raw之外的任何其他向量类型。还有支持缺失值的其他原子向量类型的常数NA_integer_,NA_real_,NA_complex_和NA_character_:所有这些都是R语言的保留字。

NA is a logical constant of length 1 which contains a missing value indicator. NA can be coerced to any other vector type except raw. There are also constants NA_integer_, NA_real_, NA_complex_ and NA_character_ of the other atomic vector types which support missing values: all of these are reserved words in the R language.

您必须为您的函数指定正确的类型 -

You will have to specify the correct type for your function to work -

您可以强制在函数中匹配 x (注意,我们需要任何才能在子集中有超过1行的情况下工作!

You can coerce within the function to match the type of x (note we need any for this to work for situations with more than 1 row in a subset!

f <- function(x) {if any((x==9)) {return(as(NA, class(x)))} else { return(x)}}



更多data.table * ish *方法



这可能会使更多的data.table感觉使用 set (或:=

More data.table*ish* approach

It might make more data.table sense to use set (or :=) to set / replace by reference.

set(dtb, i = which(dtb[,a]==9), j = 'a', value=NA_integer_)

:= / code>在中使用向量扫描 a == 9

dtb[a == 9, a := NA_integer_]

:= 以及二进制搜索

setkeyv(dtb, 'a')
dtb[J(9), a := NA_integer_] 


b $ b

有用的注释



如果使用:= 设置方法,您似乎不需要指定 NA 类型

Useful to note

If you use the := or set approaches, you don't appear to need to specify the NA type

以下两种方式都将工作

dtb <- data.table(a=1:10)
setkeyv(dtb,'a')
dtb[a==9,a := NA]

dtb <- data.table(a=1:10)
setkeyv(dtb,'a')
set(dtb, which(dtb[,a] == 9), 'a', NA)



这是一个非常有用的错误讯息,可让您了解原因和解决方案:



This gives a very useful error message that lets you know the reason and solution:


[。data.table (DTc,J(9),:= (a,NA)):
RHS的类型('logical')必须与LHS('integer')匹配。检查和强制会对最快的情况影响性能太多。更改目标列的类型,或强制:自己的RHS(例如,使用1L而不是1)

Error in [.data.table(DTc, J(9), :=(a, NA)) : Type of RHS ('logical') must match LHS ('integer'). To check and coerce would impact performance too much for the fastest cases. Either change the type of the target column, or coerce the RHS of := yourself (e.g. by using 1L instead of 1)






这是最快的



有一个合理的大数据.set a 原位替换

library(data.table)

set.seed(1)
n <- 1e+07
DT <- data.table(a = sample(15, n, T))
setkeyv(DT, "a")
DTa <- copy(DT)
DTb <- copy(DT)
DTc <- copy(DT)
DTd <- copy(DT)
DTe <- copy(DT)

f <- function(x) {
    if (any(x == 9)) {
        return(as(NA, class(x)))
    } else {
        return(x)
    }
}

system.time({DT[a == 9, `:=`(a, NA_integer_)]})
##    user  system elapsed 
##    0.95    0.24    1.20 
system.time({DTa[a == 9, `:=`(a, NA)]})
##    user  system elapsed 
##    0.74    0.17    1.00 
system.time({DTb[J(9), `:=`(a, NA_integer_)]})
##    user  system elapsed 
##    0.02    0.00    0.02 
system.time({set(DTc, which(DTc[, a] == 9), j = "a", value = NA)})
##    user  system elapsed 
##    0.49    0.22    0.67 
system.time({set(DTc, which(DTd[, a] == 9), j = "a", value = NA_integer_)})
##    user  system elapsed 
##    0.54    0.06    0.58 
system.time({DTe[, `:=`(a, f(a)), by = a]})
##    user  system elapsed 
##    0.53    0.12    0.66 
# The are all the same!
all(identical(DT, DTa), identical(DT, DTb), identical(DT, DTc), identical(DT, 
    DTd), identical(DT, DTe))
## [1] TRUE

毫不奇怪,二进制搜索方法速度最快

Unsurprisingly the binary search approach is the fastest

这篇关于NA在data.table中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆