使用子集或 dplyr::filter 时,过滤 R 中的行会意外删除 NA [英] Filtering rows in R unexpectedly removes NAs when using subset or dplyr::filter

查看:25
本文介绍了使用子集或 dplyr::filter 时,过滤 R 中的行会意外删除 NA的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据集 df,我想删除变量 y 没有值 a 的所有行.变量 y 也包含一些 NAs:

I have a dataset df and I would like to remove all rows for which variable y does not have the value a. Variable y also contains some NAs:

df <- data.frame(x=1:3, y=c('a', NA, 'c'))

我可以使用 R 的索引语法来实现这一点:

I can achieve this using R's indexing syntax like this:

df[df$y!='a',]

  x    y
  2 <NA>
  3    c

注意这会返回 NA 和值 c - 这是我想要的.

Note this returns both the NA and the value c - which is what I want.

然而,当我使用 subsetdplyr::filter 尝试同样的事情时,NA 被剥离:

However, when I try the same thing using subset or dplyr::filter, the NA gets stripped out:

subset(df, y!='a')

  x    y
  3    c

dplyr::filter(df, y!='a')
  x    y
  3    c

为什么 subsetdplyr::filter 会这样工作?这对我来说似乎不合逻辑 - NAa 不同,那么为什么当我指定我想要所有行时去掉 NA那些变量 y 等于 a?

Why do subset and dplyr::filter work like this? It seems illogical to me - an NA is not the same as a, so why strip out the NA when I specifiy I want all rows except those where variable y equals a?

除了明确要求返回 NAs 之外,还有什么方法可以改变这些函数的行为,即

And is there some way to change the behaviour of these functions, other than explicitly asking for NAs to get returned, i.e.

subset(df, y!='a' | is.na(y))

谢谢

推荐答案

您的预期"示例行为实际上并没有返回您在问题中显示的内容.我得到:

Your example of the "expected" behavior doesn't actually return what you display in your question. I get:

> df[df$y != 'a',]
    x    y
NA NA <NA>
3   3    c

这可以说比 subsetdplyr::filter 返回的更多错误.请记住,在 R 中,NA 确实旨在表示未知",因此 df$y != 'a' 返回,

This is arguably more wrong than what subset and dplyr::filter return. Remember that in R, NA really is intended to mean "unknown", so df$y != 'a' returns,

> df$y != 'a'
[1] FALSE    NA  TRUE

所以 R 被告知您绝对不想要第一行,您确实想要最后一行,但是您是否想要第二行实际上是未知".因此,它包含一行所有 NA.

So R is being told you definitely don't want the first row, you do want the last row, but whether you want the second row is literally "unknown". As a result, it includes a row of all NAs.

很多人不喜欢这种行为,但事实就是如此.

Many people dislike this behavior, but it is what it is.

subsetdplyr::filter 做出不同的默认选择,即简单地删除 NA 行,这可以说是准确的.

subset and dplyr::filter make a different default choice which is to simply drop the NA rows, which arguably is accurate-ish.

但实际上,这里的教训是,如果您的数据具有 NA ,那仅意味着您需要在所有点都围绕它进行防御性编码,或者使用像 is.na 这样的条件(df$y) |df$y != 'a',或者在另一个答案中提到,使用基于 match%in%.

But really, the lesson here is that if your data has NAs, that just means you need to code defensively around that at all points, either by using conditions like is.na(df$y) | df$y != 'a', or as mentioned in the other answer by using %in% which is based on match.

来自 base::Extract:

提取时,数字、逻辑或字符NA索引选择未知元素,因此返回NA

When extracting, a numerical, logical or character NA index picks an unknown element and so returns NA

来自 ?base::subset:

缺失值被视为错误 [...] 对于普通向量,结果很简单 x[subset &!is.na(subset)]

missing values are taken as false [...] For ordinary vectors, the result is simply x[subset & !is.na(subset)]

来自 ?dplyr::filter

与使用 [ 的基本子集不同,条件评估为 NA 的行将被删除

Unlike base subsetting with [, rows where the condition evaluates to NA are dropped

这篇关于使用子集或 dplyr::filter 时,过滤 R 中的行会意外删除 NA的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆