使用子集或 dplyr::filter 时,过滤 R 中的行会意外删除 NA [英] Filtering rows in R unexpectedly removes NAs when using subset or dplyr::filter
问题描述
我有一个数据集 df
,我想删除变量 y
没有值 a
的所有行.变量 y
也包含一些 NAs
:
I have a dataset df
and I would like to remove all rows for which variable y
does not have the value a
. Variable y
also contains some NAs
:
df <- data.frame(x=1:3, y=c('a', NA, 'c'))
我可以使用 R 的索引语法来实现这一点:
I can achieve this using R's indexing syntax like this:
df[df$y!='a',]
x y
2 <NA>
3 c
注意这会返回 NA
和值 c
- 这是我想要的.
Note this returns both the NA
and the value c
- which is what I want.
然而,当我使用 subset
或 dplyr::filter
尝试同样的事情时,NA
被剥离:
However, when I try the same thing using subset
or dplyr::filter
, the NA
gets stripped out:
subset(df, y!='a')
x y
3 c
dplyr::filter(df, y!='a')
x y
3 c
为什么 subset
和 dplyr::filter
会这样工作?这对我来说似乎不合逻辑 - NA
与 a
不同,那么为什么当我指定我想要所有行时去掉 NA
那些变量 y
等于 a
?
Why do subset
and dplyr::filter
work like this? It seems illogical to me - an NA
is not the same as a
, so why strip out the NA
when I specifiy I want all rows except those where variable y
equals a
?
除了明确要求返回 NAs
之外,还有什么方法可以改变这些函数的行为,即
And is there some way to change the behaviour of these functions, other than explicitly asking for NAs
to get returned, i.e.
subset(df, y!='a' | is.na(y))
谢谢
推荐答案
您的预期"示例行为实际上并没有返回您在问题中显示的内容.我得到:
Your example of the "expected" behavior doesn't actually return what you display in your question. I get:
> df[df$y != 'a',]
x y
NA NA <NA>
3 3 c
这可以说比 subset
和 dplyr::filter
返回的更多错误.请记住,在 R 中,NA
确实旨在表示未知",因此 df$y != 'a'
返回,
This is arguably more wrong than what subset
and dplyr::filter
return. Remember that in R, NA
really is intended to mean "unknown", so df$y != 'a'
returns,
> df$y != 'a'
[1] FALSE NA TRUE
所以 R 被告知您绝对不想要第一行,您确实想要最后一行,但是您是否想要第二行实际上是未知".因此,它包含一行所有 NA
.
So R is being told you definitely don't want the first row, you do want the last row, but whether you want the second row is literally "unknown". As a result, it includes a row of all NA
s.
很多人不喜欢这种行为,但事实就是如此.
Many people dislike this behavior, but it is what it is.
subset
和 dplyr::filter
做出不同的默认选择,即简单地删除 NA
行,这可以说是准确的.
subset
and dplyr::filter
make a different default choice which is to simply drop the NA
rows, which arguably is accurate-ish.
但实际上,这里的教训是,如果您的数据具有 NA
,那仅意味着您需要在所有点都围绕它进行防御性编码,或者使用像 is.na 这样的条件(df$y) |df$y != 'a'
,或者在另一个答案中提到,使用基于 match
的 %in%
.
But really, the lesson here is that if your data has NA
s, that just means you need to code defensively around that at all points, either by using conditions like is.na(df$y) | df$y != 'a'
, or as mentioned in the other answer by using %in%
which is based on match
.
来自 base::Extract
:
提取时,数字、逻辑或字符NA
索引选择未知元素,因此返回NA
When extracting, a numerical, logical or character
NA
index picks an unknown element and so returnsNA
来自 ?base::subset
:
缺失值被视为错误 [...] 对于普通向量,结果很简单 x[subset &!is.na(subset)]
missing values are taken as false [...] For ordinary vectors, the result is simply
x[subset & !is.na(subset)]
来自 ?dplyr::filter
与使用 [
的基本子集不同,条件评估为 NA
的行将被删除
Unlike base subsetting with
[
, rows where the condition evaluates toNA
are dropped
这篇关于使用子集或 dplyr::filter 时,过滤 R 中的行会意外删除 NA的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!