使用子集或dplyr :: filter时,在R中过滤行会意外删除NA。 [英] Filtering rows in R unexpectedly removes NAs when using subset or dplyr::filter
问题描述
我有一个数据集 df
,我想删除变量 y
不具有的所有行。值 a
。变量 y
还包含一些 NAs
:
I have a dataset df
and I would like to remove all rows for which variable y
does not have the value a
. Variable y
also contains some NAs
:
df <- data.frame(x=1:3, y=c('a', NA, 'c'))
我可以使用R的索引语法来做到这一点:
I can achieve this using R's indexing syntax like this:
df[df$y!='a',]
x y
2 <NA>
3 c
请注意,这会同时返回 NA
和值 c
-这就是我想要的。
Note this returns both the NA
and the value c
- which is what I want.
但是,当我尝试相同使用子集
或 dplyr :: filter
, NA
被剥离:
However, when I try the same thing using subset
or dplyr::filter
, the NA
gets stripped out:
subset(df, y!='a')
x y
3 c
dplyr::filter(df, y!='a')
x y
3 c
为什么子集
和 dplyr :: filter
的工作原理这个?对我来说似乎不合逻辑- NA
与 a
不同,所以为什么要剔除 NA
当我指定时,我希望所有行,但变量 y
等于 a
的行除外?
Why do subset
and dplyr::filter
work like this? It seems illogical to me - an NA
is not the same as a
, so why strip out the NA
when I specifiy I want all rows except those where variable y
equals a
?
除了明确要求 NA
之外,还有什么方法可以更改这些功能的行为?返回,即
And is there some way to change the behaviour of these functions, other than explicitly asking for NAs
to get returned, i.e.
subset(df, y!='a' | is.na(y))
谢谢
推荐答案
您的预期行为示例实际上并未返回您在问题中显示的内容。我得到:
Your example of the "expected" behavior doesn't actually return what you display in your question. I get:
> df[df$y != 'a',]
x y
NA NA <NA>
3 3 c
可以说这是更多错误的子集
和 dplyr :: filter
返回。请记住,在R中, NA
的真正意思是未知,因此 df $ y!='a'
返回,
This is arguably more wrong than what subset
and dplyr::filter
return. Remember that in R, NA
really is intended to mean "unknown", so df$y != 'a'
returns,
> df$y != 'a'
[1] FALSE NA TRUE
因此R为被告知您绝对不希望第一行,您确实想要最后一行,但是是否要第二行在字面上是未知的。结果,它包括所有 NA
s的行。
So R is being told you definitely don't want the first row, you do want the last row, but whether you want the second row is literally "unknown". As a result, it includes a row of all NA
s.
许多人不喜欢这种行为,但这是
Many people dislike this behavior, but it is what it is.
子集
和 dplyr :: filter
做出不同的默认选择,即简单地删除 NA
行,这可能是准确的。
subset
and dplyr::filter
make a different default choice which is to simply drop the NA
rows, which arguably is accurate-ish.
但是,实际上,这里的教训是,如果您的数据具有 NA
s,这仅意味着您需要在所有方面对此进行防御性编码,方法是使用 is.na(df $ y)| df $ y!='a'
,或者如其他答案中所述,使用基于%in%
>匹配。
But really, the lesson here is that if your data has NA
s, that just means you need to code defensively around that at all points, either by using conditions like is.na(df$y) | df$y != 'a'
, or as mentioned in the other answer by using %in%
which is based on match
.
这篇关于使用子集或dplyr :: filter时,在R中过滤行会意外删除NA。的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!