在R中的数据表中选择NA [英] Select NA in a data.table in R

查看：337 发布时间：2017/3/12 10:01:19 r select data.table missing-data na

本文介绍了在R中的数据表中选择NA的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

如何选择数据表中主键中缺少值的所有行。

How do I select all the rows that have a missing value in the primary key in a data table.

DT = data.table(x=rep(c("a","b",NA),each=3), y=c(1,3,6), v=1:9)
setkey(DT,x)

选择特定值很容易

DT["a",]

对于缺少的值似乎需要一个向量搜索。不能使用二进制搜索。我是否正确？

Selecting for the missing values seems to require a vector search. One cannot use binary search. Am I correct?

DT[NA,]# does not work
DT[is.na(x),] #does work

推荐答案

幸运的是， DT [a，] ，所以在实践中，DT [is.na（x），] 几乎和（eg） ，这可能并不重要：


Fortunately, DT[is.na(x),] is nearly as fast as (e.g.) DT["a",], so in practice, this may not really matter much:
library(data.table)
library(rbenchmark)

DT = data.table(x=rep(c("a","b",NA),each=3e6), y=c(1,3,6), v=1:9)
setkey(DT,x)  

benchmark(DT["a",],
          DT[is.na(x),],
          replications=20)
#             test replications elapsed relative user.self sys.self user.child
# 1      DT["a", ]           20    9.18    1.000      7.31     1.83         NA
# 2 DT[is.na(x), ]           20   10.55    1.149      8.69     1.85         NA

 === 
来自Matthew的添加（不适合评论）：
Addition from Matthew (won't fit in comment) :
上面的数据有3个非常大的组。因此，二进制搜索的速度优势在这里由创建大子集的时间所占据（1/3的数据被复制）。
The data above has 3 very large groups, though. So the speed advantage of binary search is dominated here by the time to create the large subset (1/3 of the data is copied).
benchmark(DT["a",],  # repeat select of large subset on my netbook
    DT[is.na(x),],
    replications=3)
          test replications elapsed relative user.self sys.self
     DT["a", ]            3   2.406    1.000     2.357    0.044
DT[is.na(x), ]            3   3.876    1.611     3.812    0.056

benchmark(DT["a",which=TRUE],   # isolate search time
    DT[is.na(x),which=TRUE],
    replications=3)
                      test replications elapsed relative user.self sys.self
     DT["a", which = TRUE]            3   0.492    1.000     0.492    0.000
DT[is.na(x), which = TRUE]            3   2.941    5.978     2.932    0.004

随着返回的子集的大小减少（例如添加更多的组），差异变得明显。在单个列上的向量扫描不会太差，但是在2个或更多列上，它会快速降级。
As the size of the subset returned decreases (e.g. adding more groups), the difference becomes apparent. Vector scans on a single column aren't too bad, but on 2 or more columns it quickly degrades.
也许NAs应该可以连接。我似乎记得一个与此有关的，虽然。以下是 FR＃1043允许或不允许键中的NA链接的历史记录。它提到 NA_integer _ 在内部是一个负整数。这会增加radix /计数排序（iirc），导致 setkey 变慢。但它在列表中重温。
Maybe NAs should be joinable to. I seem to remember a gotcha with that, though. Here's some history linked from FR#1043 Allow or disallow NA in keys?. It mentions there that NA_integer_ is internally a negative integer. That trips up radix/counting sort (iirc) resulting in setkey going slower. But it's on the list to revisit.

                        这篇关于在R中的数据表中选择NA的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

在R中的数据表中选择NA [英] Select NA in a data.table in R

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在R中的数据表中选择NA [英] Select NA in a data.table in R

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭