由NaN的数据表子集不工作 [英] data.table subsetting by NaN doesn't work

查看:96
本文介绍了由NaN的数据表子集不工作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在数据表中有一个具有 NaN 值的列。类似于:

  my.dt<  -  data.table(x = c(NaN,NaN,NaN, .2,.2,.3),y = c(2,4,6,8,10,12,14))
setkey(my.dt,x)

我可以使用 J()函数来查找 x 列等于.2

  my.dt [J(.2)] 

xy
1:0.2 10
2:0.2 12

但是如果我尝试用 NaN 做同样的事情,它不工作。

 > my.dt [J(NaN)] 

xy
1:NaN NA


$ b b

我会期望:

  xy 
1:NaN 2
2:NaN 4
3:NaN 6

我不能在data.table文档中找到任何解释为什么这是发生(虽然它可能只是,我不知道要找什么)。有什么办法得到我想要的吗?最后,我想用零替换所有的 NaN 值,使用 my.dt [J(NaN),x: 0]

解决方案

更新: ,in v1.9.2。从新闻


NA NaN + Inf -Inf 现在被认为是不同的值,可以在键中,可以连接到和可以分组。 data.table定义: NaN < -Inf。感谢Martin Liberts的建议,#4684,#4815和#4883。




  require (data.table)## 1.9.2+ 
my.dt [J(NaN)]
#xy
#1:NaN 2
#2:NaN 4
#3:NaN 6






部分设计选择,部分bug。有几个问题的SO和几个电子邮件在listserv探索NA在 data.table 键。



主要想法概述在常见问题,因为 NA 被视为 FALSE



请在邮件列表的对话中感到自由。有一个由@Arun开始的对话,



http://r.789695.n4.nabble.com/Follow-up-on-subsetting-data-table-with-NAs-td4669097 .html



此外,您可以在对SO的以下问题的答案和评论中阅读更多内容:



对数据进行子集化。表使用!=<某些非NA>也不包括NA。

NA在`i`表达式data.table(可能的错误)




>

在此期间,您最好的选择是使用 is.na

比基数搜索,它仍然比在 R 中的大多数向量搜索更快,并且当然比任何奇特的解决方法快得多

  library(microbenchmark)
microbenchmark(my.dt [。(1)],my.dt [is.na(ID)],my.dt [ID == 1],my.dt [!!!(ID)])
#单位:毫秒
expr median
my.dt [。(1)] 1.309948
.dt [is.na(ID)] 3.444689<〜〜Not bad
my.dt [ID == 1] 4.005093
my.dt [!(!(ID))] 10.038134

###对于my.dt使用以下命令
my.dt< - as.data.table(replicate(20,sample(100,1e5,TRUE)))
setnames(my.dt,1,ID)
my.dt [sample(1e5,1e3),ID:= NA]
setkey(my.dt,ID)


I have a column in a data table with NaN values. Something like:

my.dt <- data.table(x = c(NaN, NaN, NaN, .1, .2, .2, .3), y = c(2, 4, 6, 8, 10, 12, 14))
setkey(my.dt, x)

I can use the J() function to find all instances where the x column is equal to .2

> my.dt[J(.2)]

     x  y
1: 0.2 10
2: 0.2 12

But if I try to do the same thing with NaN it doesn't work.

> my.dt[J(NaN)]

     x  y
1: NaN NA

I would expect:

     x  y
1: NaN  2
2: NaN  4
3: NaN  6

What gives? I can't find anything in the data.table documentation to explain why this is happening (although it may just be that I don't know what to look for). Is there any way to get what I want? Ultimately, I'd like to replace all of the NaN values with zero, using something like my.dt[J(NaN), x := 0]

解决方案

Update: This has been fixed a while back, in v1.9.2. From NEWS:

NA, NaN, +Inf and -Inf are now considered distinct values, may be in keys, can be joined to and can be grouped. data.table defines: NA < NaN < -Inf. Thanks to Martin Liberts for the suggestions, #4684, #4815 and #4883.

require(data.table) ## 1.9.2+
my.dt[J(NaN)]
#      x  y
# 1: NaN  2
# 2: NaN  4
# 3: NaN  6


This issue is part design choice, part bug. There are several questions on SO and a few emails on the listserv exploring NA's in data.table key.

The main idea is outlined in the FAQ in that NA's are treated as FALSE

Please feel free chime in on the conversation in the mailing list. There was a conversation started by @Arun,

http://r.789695.n4.nabble.com/Follow-up-on-subsetting-data-table-with-NAs-td4669097.html

Also, you can read more in the answers and comments to any of the following questions on SO:

subsetting a data.table using !=<some non-NA> excludes NA too
NA in `i` expression of data.table (possible bug)
DT[!(x == .)] and DT[x != .] treat NA in x inconsistently


In the meantime, your best bet is to use is.na.
While it is slower than a radix search, it is still faster than most vector searches in R, and certainly much, much faster than any fancy workarounds

library(microbenchmark)
microbenchmark(my.dt[.(1)], my.dt[is.na(ID)], my.dt[ID==1], my.dt[!!!(ID)])
# Unit: milliseconds
               expr    median 
        my.dt[.(1)]  1.309948 
   my.dt[is.na(ID)]  3.444689   <~~ Not bad
     my.dt[ID == 1]  4.005093 
 my.dt[!(!(!(ID)))] 10.038134 

### using the following for my.dt
my.dt <- as.data.table(replicate(20, sample(100, 1e5, TRUE)))
setnames(my.dt, 1, "ID")
my.dt[sample(1e5, 1e3), ID := NA]
setkey(my.dt, ID)

这篇关于由NaN的数据表子集不工作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆