使用!=<某些非NA>子集化数据表。也不包括NA [英] subsetting a data.table using !=<some non-NA> excludes NA too
问题描述
我有一个data.table的列有 NA
的。我想删除该列需要一个特定值(这恰好是)的行。但是,我的第一个尝试让我失去了
NA
的行:
I have a data.table with a column that has NA
s. I want to drop rows where that column takes a particular value (which happens to be ""
). However, my first attempt lead me to lose rows with NA
s as well:
> a = c(1,"",NA)
> x <- data.table(a);x
a
1: 1
2:
3: NA
> y <- x[a!=""];y
a
1: 1
$ b b
看到?`!=`
后,我发现一个班轮工作,但很痛苦:
After looking at ?`!=`
, I found a one liner that works, but it's a pain:
> z <- x[!sapply(a,function(x)identical(x,""))]; z
a
1: 1
2: NA
<想知道是否有更好的方法来做到这一点?此外,我看不到扩展这个方法排除多个非 - NA
值。这是一个糟糕的方式:
I'm wondering if there's a better way to do this? Also, I see no good way of extending this to excluding multiple non-NA
values. Here's a bad way:
> drop_these <- function(these,where){
+ argh <- !sapply(where,
+ function(x)unlist(lapply(as.list(these),function(this)identical(x,this)))
+ )
+ if (is.matrix(argh)){argh <- apply(argh,2,all)}
+ return(argh)
+ }
> x[drop_these("",a)]
a
1: 1
2: NA
> x[drop_these(c(1,""),a)]
a
1: NA
我看过?J
并尝试了一个data.frame,似乎工作不同,保持 NA
s when subsetting:
I looked at ?J
and tried things out with a data.frame, which seems to work differently, keeping NA
s when subsetting:
> w <- data.frame(a,stringsAsFactors=F); w
a
1 1
2
3 <NA>
> d <- w[a!="",,drop=F]; d
a
1 1
NA <NA>
推荐答案
为您的问题提供解决方案:
您应该使用%in%
。它给你一个逻辑的向量。
To provide a solution to your question:
You should use %in%
. It gives you back a logical vector.
a %in% ""
# [1] FALSE TRUE FALSE
x[!a %in% ""]
# a
# 1: 1
# 2: NA
要找出为什么这是发生在 .table
:
(反对 data.frame
)
如果您查看文件上的
data.table
源代码data.table.R 在
[。data.table]
下有一组 if-statements
检查 i
参数。其中之一是:
If you look at the data.table
source code on the file data.table.R
under the function "[.data.table"
, there's a set of if-statements
that check for i
argument. One of them is:
if (!missing(i)) {
# Part (1)
isub = substitute(i)
# Part (2)
if (is.call(isub) && isub[[1L]] == as.name("!")) {
notjoin = TRUE
if (!missingnomatch) stop("not-join '!' prefix is present on i but nomatch is provided. Please remove nomatch.");
nomatch = 0L
isub = isub[[2L]]
}
.....
# "isub" is being evaluated using "eval" to result in a logical vector
# Part 3
if (is.logical(i)) {
# see DT[NA] thread re recycling of NA logical
if (identical(i,NA)) i = NA_integer_
# avoids DT[!is.na(ColA) & !is.na(ColB) & ColA==ColB], just DT[ColA==ColB]
else i[is.na(i)] = FALSE
}
....
}
为了解释差异,我在这里粘贴了重要的代码。我也将它们标记为3部分。
To explain the discrepancy, I've pasted the important piece of code here. And I've also marked them into 3 parts.
首先,第1部分
评估类 call
。 第2部分
中的if语句的第二部分返回FALSE。随后,调用
被计算给出 c(TRUE,FALSE,NA)
。然后执行第3部分
。因此, NA
替换为 FALSE
(逻辑循环的最后一行)。
First, part 1
evaluates to an object of class call
. The second part of the if statement in part 2
returns FALSE. Following that, the call
is "evaluated" to give c(TRUE, FALSE, NA)
. Then part 3
is executed. So, NA
is replaced to FALSE
(the last line of the logical loop).
第1部分
再次返回呼叫。但是,第2部分
计算为TRUE,因此设置为:
part 1
returns a call once again. But, part 2
evaluates to TRUE and therefore sets:
1) `notjoin = TRUE`
2) isub <- isub[[2L]] # which is equal to (a == "") without the ! (exclamation)
这是魔法发生的地方。否定已被删除。并且记住,这仍然是类 call 的对象。所以这得到评估(使用 eval
)再次逻辑。因此,(a ==)
计算为 c(FALSE,TRUE,NA)
。
That is where the magic happened. The negation has been removed for now. And remember, this is still an object of class call. So this gets evaluated (using eval
) to logical again. So, (a=="")
evaluates to c(FALSE, TRUE, NA)
.
现在,检查 is.logical
在第3部分
。因此,在这里, NA
将替换为 FALSE
。因此它变为 c(FALSE,TRUE,FALSE)
。在稍后的某个时刻,执行一个(c(F,T,F))
,这里得到2。因为 notjoin = TRUE
(从第2部分
) seq_len(nrow 2]
= c(1,3)。因此, x [!(a ==)]
基本上返回 x [c(1,3)]
这是所希望的结果。以下是相关的代码片段:
Now, this is checked for is.logical
in part 3
. So, here, NA
gets replaced to FALSE
. It therefore becomes, c(FALSE, TRUE, FALSE)
. At some point later, a which(c(F,T,F))
is executed, which results in 2 here. Because notjoin = TRUE
(from part 2
) seq_len(nrow(x))[-2]
= c(1,3) is returned. so, x[!(a=="")]
basically returns x[c(1,3)]
which is the desired result. Here's the relevant code snippet:
if (notjoin) {
if (bywithoutby || !is.integer(irows) || is.na(nomatch)) stop("Internal error: notjoin but bywithoutby or !integer or nomatch==NA")
irows = irows[irows!=0L]
# WHERE MAGIC HAPPENS (returns c(1,3))
i = irows = if (length(irows)) seq_len(nrow(x))[-irows] else NULL # NULL meaning all rows i.e. seq_len(nrow(x))
# Doing this once here, helps speed later when repeatedly subsetting each column. R's [irows] would do this for each
# column when irows contains negatives.
}
鉴于此,我认为语法有些不一致。我设法得到时间来制定问题,然后我会很快写一个帖子。
Given that, I think there are some inconsistencies with the syntax.. And if I manage to get time to formulate the problem, then I'll write a post soon.
这篇关于使用!=<某些非NA>子集化数据表。也不包括NA的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!