使用子集或dplyr :: filter时,在R中过滤行会意外删除NA。 [英] Filtering rows in R unexpectedly removes NAs when using subset or dplyr::filter

查看:218
本文介绍了使用子集或dplyr :: filter时,在R中过滤行会意外删除NA。的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据集 df ,我想删除变量 y 不具有的所有行。值 a 。变量 y 还包含一些 NAs

I have a dataset df and I would like to remove all rows for which variable y does not have the value a. Variable y also contains some NAs:

df <- data.frame(x=1:3, y=c('a', NA, 'c'))

我可以使用R的索引语法来做到这一点:

I can achieve this using R's indexing syntax like this:

df[df$y!='a',]

  x    y
  2 <NA>
  3    c

请注意,这会同时返回 NA 和值 c -这就是我想要的。

Note this returns both the NA and the value c - which is what I want.

但是,当我尝试相同使用子集 dplyr :: filter NA 被剥离:

However, when I try the same thing using subset or dplyr::filter, the NA gets stripped out:

subset(df, y!='a')

  x    y
  3    c

dplyr::filter(df, y!='a')
  x    y
  3    c

为什么子集 dplyr :: filter 的工作原理这个?对我来说似乎不合逻辑- NA a 不同,所以为什么要剔除 NA 当我指定时,我希望所有行,但变量 y 等于 a 的行除外?

Why do subset and dplyr::filter work like this? It seems illogical to me - an NA is not the same as a, so why strip out the NA when I specifiy I want all rows except those where variable y equals a?

除了明确要求 NA 之外,还有什么方法可以更改这些功能的行为?返回,即

And is there some way to change the behaviour of these functions, other than explicitly asking for NAs to get returned, i.e.

subset(df, y!='a' | is.na(y))

谢谢

推荐答案

您的预期行为示例实际上并未返回您在问题中显示的内容。我得到:

Your example of the "expected" behavior doesn't actually return what you display in your question. I get:

> df[df$y != 'a',]
    x    y
NA NA <NA>
3   3    c

可以说这是更多错误的子集 dplyr :: filter 返回。请记住,在R中, NA 的真正意思是未知,因此 df $ y!='a'返回,

This is arguably more wrong than what subset and dplyr::filter return. Remember that in R, NA really is intended to mean "unknown", so df$y != 'a' returns,

> df$y != 'a'
[1] FALSE    NA  TRUE

因此R为被告知您绝对不希望第一行,您确实想要最后一行,但是是否要第二行在字面上是未知的。结果,它包括所有 NA s的行。

So R is being told you definitely don't want the first row, you do want the last row, but whether you want the second row is literally "unknown". As a result, it includes a row of all NAs.

许多人不喜欢这种行为,但这是

Many people dislike this behavior, but it is what it is.

子集 dplyr :: filter 做出不同的默认选择,即简单地删除 NA 行,这可能是准确的。

subset and dplyr::filter make a different default choice which is to simply drop the NA rows, which arguably is accurate-ish.

但是,实际上,这里的教训是,如果您的数据具有 NA s,这仅意味着您需要在所有方面对此进行防御性编码,方法是使用 is.na(df $ y)| df $ y!='a',或者如其他答案中所述,使用基于%in% >匹配。

But really, the lesson here is that if your data has NAs, that just means you need to code defensively around that at all points, either by using conditions like is.na(df$y) | df$y != 'a', or as mentioned in the other answer by using %in% which is based on match.

这篇关于使用子集或dplyr :: filter时,在R中过滤行会意外删除NA。的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆