使用!=<某些非NA>子集化数据表。也不包括NA [英] subsetting a data.table using !=<some non-NA> excludes NA too

查看:142
本文介绍了使用!=<某些非NA>子集化数据表。也不包括NA的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个data.table的列有 NA 的。我想删除该列需要一个特定值(这恰好是)的行。但是,我的第一个尝试让我失去了 NA 的行:

I have a data.table with a column that has NAs. I want to drop rows where that column takes a particular value (which happens to be ""). However, my first attempt lead me to lose rows with NAs as well:

> a = c(1,"",NA)
> x <- data.table(a);x
    a
1:  1
2:   
3: NA
> y <- x[a!=""];y
   a
1: 1


$ b b

看到?`!=`后,我发现一个班轮工作,但很痛苦:

After looking at ?`!=`, I found a one liner that works, but it's a pain:

> z <- x[!sapply(a,function(x)identical(x,""))]; z
    a
1:  1
2: NA



<想知道是否有更好的方法来做到这一点?此外,我看不到扩展这个方法排除多个非 - NA 值。这是一个糟糕的方式:

I'm wondering if there's a better way to do this? Also, I see no good way of extending this to excluding multiple non-NA values. Here's a bad way:

>     drop_these <- function(these,where){
+         argh <- !sapply(where,
+             function(x)unlist(lapply(as.list(these),function(this)identical(x,this)))
+         )
+         if (is.matrix(argh)){argh <- apply(argh,2,all)}
+         return(argh)
+     }
>     x[drop_these("",a)]
    a
1:  1
2: NA
>     x[drop_these(c(1,""),a)]
    a
1: NA

我看过?J 并尝试了一个data.frame,似乎工作不同,保持 NA s when subsetting:

I looked at ?J and tried things out with a data.frame, which seems to work differently, keeping NAs when subsetting:

> w <- data.frame(a,stringsAsFactors=F); w
     a
1    1
2     
3 <NA>
> d <- w[a!="",,drop=F]; d
      a
1     1
NA <NA>


推荐答案

为您的问题提供解决方案:



您应该使用%in%。它给你一个逻辑的向量。

To provide a solution to your question:

You should use %in%. It gives you back a logical vector.

a %in% ""
# [1] FALSE  TRUE FALSE

x[!a %in% ""]
#     a
# 1:  1
# 2: NA






要找出为什么这是发生在 .table



(反对 data.frame

如果您查看文件上的 data.table 源代码data.table.R [。data.table] 下有一组 if-statements 检查 i 参数。其中之一是:

If you look at the data.table source code on the file data.table.R under the function "[.data.table", there's a set of if-statements that check for i argument. One of them is:

if (!missing(i)) {
    # Part (1)
    isub = substitute(i)

    # Part (2)
    if (is.call(isub) && isub[[1L]] == as.name("!")) {
        notjoin = TRUE
        if (!missingnomatch) stop("not-join '!' prefix is present on i but nomatch is provided. Please remove nomatch.");
        nomatch = 0L
        isub = isub[[2L]]
    }

    .....
    # "isub" is being evaluated using "eval" to result in a logical vector

    # Part 3
    if (is.logical(i)) {
        # see DT[NA] thread re recycling of NA logical
        if (identical(i,NA)) i = NA_integer_  
        # avoids DT[!is.na(ColA) & !is.na(ColB) & ColA==ColB], just DT[ColA==ColB]
        else i[is.na(i)] = FALSE  
    }
    ....
}

为了解释差异,我在这里粘贴了重要的代码。我也将它们标记为3部分。

To explain the discrepancy, I've pasted the important piece of code here. And I've also marked them into 3 parts.

首先,第1部分评估类 call 第2部分中的if语句的第二部分返回FALSE。随后,调用被计算给出 c(TRUE,FALSE,NA)。然后执行第3部分。因此, NA 替换为 FALSE (逻辑循环的最后一行)。

First, part 1 evaluates to an object of class call. The second part of the if statement in part 2 returns FALSE. Following that, the call is "evaluated" to give c(TRUE, FALSE, NA) . Then part 3 is executed. So, NA is replaced to FALSE (the last line of the logical loop).

第1部分再次返回呼叫。但是,第2部分计算为TRUE,因此设置为:

part 1 returns a call once again. But, part 2 evaluates to TRUE and therefore sets:

1) `notjoin = TRUE`
2) isub <- isub[[2L]] # which is equal to (a == "") without the ! (exclamation)

这是魔法发生的地方。否定已被删除。并且记住,这仍然是类 call 的对象。所以这得到评估(使用 eval )再次逻辑。因此,(a ==)计算为 c(FALSE,TRUE,NA)

That is where the magic happened. The negation has been removed for now. And remember, this is still an object of class call. So this gets evaluated (using eval) to logical again. So, (a=="") evaluates to c(FALSE, TRUE, NA).

现在,检查 is.logical 第3部分。因此,在这里, NA 将替换为 FALSE 。因此它变为 c(FALSE,TRUE,FALSE)。在稍后的某个时刻,执行一个(c(F,T,F)),这里得到2。因为 notjoin = TRUE (从第2部分 seq_len(nrow 2] = c(1,3)。因此, x [!(a ==)] 基本上返回 x [c(1,3)] 这是所希望的结果。以下是相关的代码片段:

Now, this is checked for is.logical in part 3. So, here, NA gets replaced to FALSE. It therefore becomes, c(FALSE, TRUE, FALSE). At some point later, a which(c(F,T,F)) is executed, which results in 2 here. Because notjoin = TRUE (from part 2) seq_len(nrow(x))[-2] = c(1,3) is returned. so, x[!(a=="")] basically returns x[c(1,3)] which is the desired result. Here's the relevant code snippet:

if (notjoin) {
    if (bywithoutby || !is.integer(irows) || is.na(nomatch)) stop("Internal error: notjoin but bywithoutby or !integer or nomatch==NA")
    irows = irows[irows!=0L]
    # WHERE MAGIC HAPPENS (returns c(1,3))
    i = irows = if (length(irows)) seq_len(nrow(x))[-irows] else NULL  # NULL meaning all rows i.e. seq_len(nrow(x))
    # Doing this once here, helps speed later when repeatedly subsetting each column. R's [irows] would do this for each
    # column when irows contains negatives.
}

鉴于此,我认为语法有些不一致。我设法得到时间来制定问题,然后我会很快写一个帖子。

Given that, I think there are some inconsistencies with the syntax.. And if I manage to get time to formulate the problem, then I'll write a post soon.

这篇关于使用!=&lt;某些非NA&gt;子集化数据表。也不包括NA的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆