由NaN的数据表子集不工作 [英] data.table subsetting by NaN doesn't work
问题描述
我在数据表中有一个具有 NaN
值的列。类似于:
my.dt< - data.table(x = c(NaN,NaN,NaN, .2,.2,.3),y = c(2,4,6,8,10,12,14))
setkey(my.dt,x)
我可以使用 J()
函数来查找 x
列等于.2
my.dt [J(.2)]
xy
1:0.2 10
2:0.2 12
但是如果我尝试用 NaN
做同样的事情,它不工作。
> my.dt [J(NaN)]
xy
1:NaN NA
$ b b
我会期望:
xy
1:NaN 2
2:NaN 4
3:NaN 6
我不能在data.table文档中找到任何解释为什么这是发生(虽然它可能只是,我不知道要找什么)。有什么办法得到我想要的吗?最后,我想用零替换所有的 NaN
值,使用 my.dt [J(NaN),x: 0]
更新: ,in v1.9.2。从新闻:
NA
,NaN
,+ Inf
和-Inf
现在被认为是不同的值,可以在键中,可以连接到和可以分组。 data.table定义: NaN < -Inf。感谢Martin Liberts的建议,#4684,#4815和#4883。
require (data.table)## 1.9.2+
my.dt [J(NaN)]
#xy
#1:NaN 2
#2:NaN 4
#3:NaN 6
部分设计选择,部分bug。有几个问题的SO和几个电子邮件在listserv探索NA在
data.table
键。
主要想法概述在常见问题,因为
NA
被视为FALSE
请在邮件列表的对话中感到自由。有一个由@Arun开始的对话,
http://r.789695.n4.nabble.com/Follow-up-on-subsetting-data-table-with-NAs-td4669097 .html
此外,您可以在对SO的以下问题的答案和评论中阅读更多内容:
对数据进行子集化。表使用!=<某些非NA>也不包括NA。
NA在`i`表达式data.table(可能的错误)
>
在此期间,您最好的选择是使用
is.na
。
比基数搜索,它仍然比在R
中的大多数向量搜索更快,并且当然比任何奇特的解决方法快得多library(microbenchmark)
microbenchmark(my.dt [。(1)],my.dt [is.na(ID)],my.dt [ID == 1],my.dt [!!!(ID)])
#单位:毫秒
expr median
my.dt [。(1)] 1.309948
.dt [is.na(ID)] 3.444689<〜〜Not bad
my.dt [ID == 1] 4.005093
my.dt [!(!(ID))] 10.038134
###对于my.dt使用以下命令
my.dt< - as.data.table(replicate(20,sample(100,1e5,TRUE)))
setnames(my.dt,1,ID)
my.dt [sample(1e5,1e3),ID:= NA]
setkey(my.dt,ID)
I have a column in a data table with
NaN
values. Something like:my.dt <- data.table(x = c(NaN, NaN, NaN, .1, .2, .2, .3), y = c(2, 4, 6, 8, 10, 12, 14)) setkey(my.dt, x)
I can use the
J()
function to find all instances where thex
column is equal to .2> my.dt[J(.2)] x y 1: 0.2 10 2: 0.2 12
But if I try to do the same thing with
NaN
it doesn't work.> my.dt[J(NaN)] x y 1: NaN NA
I would expect:
x y 1: NaN 2 2: NaN 4 3: NaN 6
What gives? I can't find anything in the data.table documentation to explain why this is happening (although it may just be that I don't know what to look for). Is there any way to get what I want? Ultimately, I'd like to replace all of the
NaN
values with zero, using something likemy.dt[J(NaN), x := 0]
解决方案Update: This has been fixed a while back, in v1.9.2. From NEWS:
NA
,NaN
,+Inf
and-Inf
are now considered distinct values, may be in keys, can be joined to and can be grouped. data.table defines: NA < NaN < -Inf. Thanks to Martin Liberts for the suggestions, #4684, #4815 and #4883.
require(data.table) ## 1.9.2+ my.dt[J(NaN)] # x y # 1: NaN 2 # 2: NaN 4 # 3: NaN 6
This issue is part design choice, part bug. There are several questions on SO and a few emails on the listserv exploring NA's in
data.table
key.The main idea is outlined in the FAQ in that
NA
's are treated asFALSE
Please feel free chime in on the conversation in the mailing list. There was a conversation started by @Arun,
http://r.789695.n4.nabble.com/Follow-up-on-subsetting-data-table-with-NAs-td4669097.html
Also, you can read more in the answers and comments to any of the following questions on SO:
subsetting a data.table using !=<some non-NA> excludes NA too
NA in `i` expression of data.table (possible bug)
DT[!(x == .)] and DT[x != .] treat NA in x inconsistently
In the meantime, your best bet is to use
is.na
.
While it is slower than a radix search, it is still faster than most vector searches inR
, and certainly much, much faster than any fancy workaroundslibrary(microbenchmark) microbenchmark(my.dt[.(1)], my.dt[is.na(ID)], my.dt[ID==1], my.dt[!!!(ID)]) # Unit: milliseconds expr median my.dt[.(1)] 1.309948 my.dt[is.na(ID)] 3.444689 <~~ Not bad my.dt[ID == 1] 4.005093 my.dt[!(!(!(ID)))] 10.038134 ### using the following for my.dt my.dt <- as.data.table(replicate(20, sample(100, 1e5, TRUE))) setnames(my.dt, 1, "ID") my.dt[sample(1e5, 1e3), ID := NA] setkey(my.dt, ID)
这篇关于由NaN的数据表子集不工作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!