使用 !=<some non-NA> 对 data.table 进行子集化也不包括 NA [英] subsetting a data.table using !=<some non-NA> excludes NA too

查看:9
本文介绍了使用 !=<some non-NA> 对 data.table 进行子集化也不包括 NA的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

I have a data.table with a column that has NAs. I want to drop rows where that column takes a particular value (which happens to be ""). However, my first attempt lead me to lose rows with NAs as well:

> a = c(1,"",NA)
> x <- data.table(a);x
    a
1:  1
2:   
3: NA
> y <- x[a!=""];y
   a
1: 1

After looking at ?`!=`, I found a one liner that works, but it's a pain:

> z <- x[!sapply(a,function(x)identical(x,""))]; z
    a
1:  1
2: NA

I'm wondering if there's a better way to do this? Also, I see no good way of extending this to excluding multiple non-NA values. Here's a bad way:

>     drop_these <- function(these,where){
+         argh <- !sapply(where,
+             function(x)unlist(lapply(as.list(these),function(this)identical(x,this)))
+         )
+         if (is.matrix(argh)){argh <- apply(argh,2,all)}
+         return(argh)
+     }
>     x[drop_these("",a)]
    a
1:  1
2: NA
>     x[drop_these(c(1,""),a)]
    a
1: NA

I looked at ?J and tried things out with a data.frame, which seems to work differently, keeping NAs when subsetting:

> w <- data.frame(a,stringsAsFactors=F); w
     a
1    1
2     
3 <NA>
> d <- w[a!="",,drop=F]; d
      a
1     1
NA <NA>

解决方案

To provide a solution to your question:

You should use %in%. It gives you back a logical vector.

a %in% ""
# [1] FALSE  TRUE FALSE

x[!a %in% ""]
#     a
# 1:  1
# 2: NA


To find out why this is happening in data.table:

(as opposted to data.frame)

If you look at the data.table source code on the file data.table.R under the function "[.data.table", there's a set of if-statements that check for i argument. One of them is:

if (!missing(i)) {
    # Part (1)
    isub = substitute(i)

    # Part (2)
    if (is.call(isub) && isub[[1L]] == as.name("!")) {
        notjoin = TRUE
        if (!missingnomatch) stop("not-join '!' prefix is present on i but nomatch is provided. Please remove nomatch.");
        nomatch = 0L
        isub = isub[[2L]]
    }

    .....
    # "isub" is being evaluated using "eval" to result in a logical vector

    # Part 3
    if (is.logical(i)) {
        # see DT[NA] thread re recycling of NA logical
        if (identical(i,NA)) i = NA_integer_  
        # avoids DT[!is.na(ColA) & !is.na(ColB) & ColA==ColB], just DT[ColA==ColB]
        else i[is.na(i)] = FALSE  
    }
    ....
}

To explain the discrepancy, I've pasted the important piece of code here. And I've also marked them into 3 parts.

First, why dt[a != ""] doesn't work as expected (by the OP)?

First, part 1 evaluates to an object of class call. The second part of the if statement in part 2 returns FALSE. Following that, the call is "evaluated" to give c(TRUE, FALSE, NA) . Then part 3 is executed. So, NA is replaced to FALSE (the last line of the logical loop).

why does x[!(a== "")] work as expected (by the OP)?

part 1 returns a call once again. But, part 2 evaluates to TRUE and therefore sets:

1) `notjoin = TRUE`
2) isub <- isub[[2L]] # which is equal to (a == "") without the ! (exclamation)

That is where the magic happened. The negation has been removed for now. And remember, this is still an object of class call. So this gets evaluated (using eval) to logical again. So, (a=="") evaluates to c(FALSE, TRUE, NA).

Now, this is checked for is.logical in part 3. So, here, NA gets replaced to FALSE. It therefore becomes, c(FALSE, TRUE, FALSE). At some point later, a which(c(F,T,F)) is executed, which results in 2 here. Because notjoin = TRUE (from part 2) seq_len(nrow(x))[-2] = c(1,3) is returned. so, x[!(a=="")] basically returns x[c(1,3)] which is the desired result. Here's the relevant code snippet:

if (notjoin) {
    if (bywithoutby || !is.integer(irows) || is.na(nomatch)) stop("Internal error: notjoin but bywithoutby or !integer or nomatch==NA")
    irows = irows[irows!=0L]
    # WHERE MAGIC HAPPENS (returns c(1,3))
    i = irows = if (length(irows)) seq_len(nrow(x))[-irows] else NULL  # NULL meaning all rows i.e. seq_len(nrow(x))
    # Doing this once here, helps speed later when repeatedly subsetting each column. R's [irows] would do this for each
    # column when irows contains negatives.
}

Given that, I think there are some inconsistencies with the syntax.. And if I manage to get time to formulate the problem, then I'll write a post soon.

这篇关于使用 !=&lt;some non-NA&gt; 对 data.table 进行子集化也不包括 NA的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆