Xtabs和R中的聚合之间的行为不一致 [英] Inconsistency of na.action between xtabs and aggregate in R
问题描述
我有以下data.frame:
I have the following data.frame:
x <- data.frame(A = c("Y", "Y", "Z", NA),
B = c(NA, TRUE, FALSE, TRUE),
C = c(TRUE, TRUE, NA, FALSE))
我需要使用xtabs
计算下表:
A B C
Y 1 2
Z 0 0
<NA> 1 0
我被告知要使用
I was told to use na.action = NULL
, which indeed returns the table I need:
xtabs(formula = cbind(B, C) ~ A,
data = x,
addNA = TRUE,
na.action = NULL)
A B C
Y 1 2
Z 0 0
<NA> 1 0
但是,na.action = na.pass
返回另一个表:
xtabs(formula = cbind(B, C) ~ A,
data = x,
addNA = TRUE,
na.action = na.pass)
A B C
Y 2
Z 0
<NA> 1 0
但是xtabs
的文档说:
na.action
当它是na.pass并且公式的左侧(带有计数)时, 使用sum(*,na.rm = TRUE)代替计数的sum(*).
na.action
When it is na.pass and formula has a left hand side (with counts), sum(*, na.rm = TRUE) is used instead of sum(*) for the counts.
对于aggregate
,na.action = na.pass
返回预期结果(以及na.action = NULL
):
With aggregate
, na.action = na.pass
returns the expected result (and also na.action = NULL
):
aggregate(formula = cbind(B, C) ~ addNA(A),
data = x,
FUN = sum,
na.rm = TRUE,
na.action = na.pass) # same result with na.action = NULL
addNA(A) B C
1 Y 1 2
2 Z 0 0
3 <NA> 1 0
尽管我可以通过xtabs
获得所需的表,但是我不理解文档中xtabs
中na.action
的行为.所以我的问题是:
Although I get the table I need with xtabs
, I do not understand the behavior of na.action
in xtabs
from the documentation. So my questions are:
-
xtabs
中na.action
的行为是否与文档一致?除非我缺少任何内容,否则na.action = na.pass
不会导致sum(*, na.rm = TRUE)
. -
na.action = NULL
是否记录在某处? - 在
xtabs
源代码中有na.rm <- identical(naAct, quote(na.omit)) || identical(naAct, na.omit) || identical(naAct, "na.omit")
.但是对于na.action = na.pass
和na.action = NULL
我什么也没看到.na.action = na.pass
和na.action = NULL
如何工作?
- Is the behavior of
na.action
inxtabs
consistent with the documentation? Unless I am missing something,na.action = na.pass
does not result insum(*, na.rm = TRUE)
. - Is
na.action = NULL
documented somewhere? - In
xtabs
source code there isna.rm <- identical(naAct, quote(na.omit)) || identical(naAct, na.omit) || identical(naAct, "na.omit")
. But I saw nothing forna.action = na.pass
andna.action = NULL
. How dona.action = na.pass
andna.action = NULL
work?
推荐答案
不给出xtabs
的工作原理很难给出一个规范的答案.如果我们逐步研究其源代码的要点,那么我们将清楚地看到正在发生的事情.
It's difficult to give a cannonical answer without describing how xtabs
works. If we step through the main points of its source code, we'll see clearly what's going on.
经过一些基本的类型检查之后,对xtabs
的调用在内部起作用,首先使用stats::model.frame
创建包含在公式中的所有变量的数据框,并为此传递了na.action
参数.
After some basic type checking, the call to xtabs
works internally by first creating a data frame of all the variables contained in your formula using stats::model.frame
, and it is to this that the na.action
parameter is passed.
执行此操作的方式非常聪明. xtabs
首先复制您通过match.call
对其进行的呼叫,如下所示:
The way it does this is quite clever. xtabs
first copies the call you made to it via match.call
, like this:
m <- match.call(expand.dots = FALSE)
然后剥离掉不需要传递给stats::model.frame
的参数,如下所示:
Then it strips out the parameters that don't need passed to stats::model.frame
like this:
m$... <- m$exclude <- m$drop.unused.levels <- m$sparse <- m$addNA <- NULL
如帮助文件中所承诺的那样,如果addNA
为TRUE
而缺少na.action
,则现在默认为na.pass
:
As promised in the help file, if addNA
is TRUE
and na.action
is missing, it will now default to na.pass
:
if (addNA && missing(na.action))
m$na.action <- quote(na.pass)
然后将要调用的函数从xtabs
更改为stats::model.frame
,如下所示:
Then it changes the function to be called from xtabs
to stats::model.frame
like this:
m[[1L]] <- quote(stats::model.frame)
所以对象m
是一个调用(也是一个独立的reprex),在您的情况下,它看起来像这样:
So the object m
is a call (and is also a standalone reprex), which in your case looks like this:
stats::model.frame(formula = cbind(B, C) ~ A, data = list(A = structure(c(1L,
1L, 2L, NA), .Label = c("Y", "Z"), class = "factor"), B = c(NA, TRUE, FALSE, TRUE),
C = c(TRUE, TRUE, NA, FALSE)), na.action = NULL)
请注意,您的na.action = NULL
已传递给此呼叫.这具有将所有NA
值保留在帧中的效果.评估上述调用后,将给出以下数据帧:
Note that your na.action = NULL
has been passed to this call. This has the effect of keeping all NA
values in the frame. When the above call is evaluated, it gives this data frame:
eval(m)
#> cbind(B, C).B cbind(B, C).C A
#> 1 NA TRUE Y
#> 2 TRUE TRUE Y
#> 3 FALSE NA Z
#> 4 TRUE FALSE <NA>
请注意,这与通过na.action = na.pass
会得到的结果相同:
Note that this is the same result you would get if you passed na.action = na.pass
:
stats::model.frame(formula = cbind(B, C) ~ A, data = list(A = structure(c(1L,
1L, 2L, NA), .Label = c("Y", "Z"), class = "factor"), B = c(NA, TRUE, FALSE, TRUE),
C = c(TRUE, TRUE, NA, FALSE)), na.action = na.pass)
#> cbind(B, C).B cbind(B, C).C A
#> 1 NA TRUE Y
#> 2 TRUE TRUE Y
#> 3 FALSE NA Z
#> 4 TRUE FALSE <NA>
但是,如果您通过了na.action = na.omit
,则只剩下一行,因为只有第2行没有NA
值.
However, if you passed na.action = na.omit
, you would only be left with a single row, since only row 2 has no NA
values.
在任何情况下,模型框架"结果都存储在变量mf
中.然后将其分为自变量-在您的情况下为A列,在响应变量-为您的情况下cbind(B, C)
.
In any case, the "model frame" result is stored in the variable mf
. This is then split into the independent variable(s), - in your case, column A, and the response variable - in your case cbind(B, C)
.
响应存储在y
中,变量存储在by
中:
The response is stored in y
and the variable in by
:
i <- attr(attr(mf, "terms"), "response")
by <- mf[-i]
y <- mf[[i]]
现在,处理by
以确保每个自变量都是一个因子,并且如果已指定addNA = TRUE
,则将任何NA
值都转换为因子水平:
Now, by
is processed to ensure each independent variable is a factor, and that any NA
values are converted into factor levels if you have specified addNA = TRUE
:
by <- lapply(by, function(u) {
if (!is.factor(u))
u <- factor(u, exclude = exclude)
else if (has.exclude)
u <- factor(as.character(u), levels = setdiff(levels(u),
exclude), exclude = NULL)
if (addNA)
u <- addNA(u, ifany = TRUE)
u[, drop = drop.unused.levels]
})
现在我们来看看症结所在.再次使用na.action
来确定如何计算响应变量中的NA
值.在您的情况下,由于传递了na.action = NULL
,因此您会看到naAct
将获得存储在getOption("na.action")
中的值,如果您从未更改过该值,则应将其设置为na.omit
.反过来,这将导致变量na.rm,
的值为TRUE
:
Now we come to the crux. The na.action
is used again to determine how the NA
values in the response variable will be counted. In your case, since you passed na.action = NULL
, you will see that naAct
will get the value stored in getOption("na.action")
, which if you have never changed it, should be set to na.omit
. This in turn will cause the value of the variable na.rm,
to be TRUE
:
naAct <- if (!is.null(m$na.action)) {
m$na.action
}else {getOption("na.action", default = quote(na.omit))}
na.rm <- identical(naAct, quote(na.omit)) || identical(naAct,
na.omit) || identical(naAct, "na.omit")
请注意,如果您已通过na.action = na.pass
,那么如果跟踪这段代码,则na.rm
将为FALSE
.
Note that if you had passed na.action = na.pass
, then na.rm
would be FALSE
if you trace this piece of code.
最后,我们来到使用tapply
内的sum
构建您的xtabs
表的部分,而tapply
本身在lapply
内.
Finally, we come to the section where your xtabs
table is built using sum
inside a tapply
, which is itself inside an lapply
.
lapply(as.data.frame(y), tapply, by, sum, na.rm = na.rm, default = 0L)
您可以看到na.rm
变量用于确定在尝试对列进行累加之前是否从列中删除NA
.然后,该lapply
的结果将被强制进入最终的交叉表.
You can see that the na.rm
variable is used to determine whether to remove NA
s from the columns before attempting to sum them. The result of this lapply
is then coerced into the final cross tab.
那么这如何回答您的问题?
So how does this answer your question?
当文档说如果您不通过na.action
时,它将默认为na.pass
,这是正确的.但是,在两个地方使用na.action
:一次是在调用model.frame
的过程中,一次是确定na.rm
的值.从源代码中非常清楚的是,如果na.action
是na.pass
,则na.rm
将是FALSE
,因此您将错过任何包含NA
值的响应组的计数.这与帮助文件中的内容相反.
It is true when the documentation says that if you don't pass an na.action
, it will default to na.pass
. However, the na.action
is used in two places: once in the call to model.frame
and once to determine the value of na.rm
. It is very clear from the source code that if na.action
is na.pass
, then na.rm
will be FALSE
, so you will miss out on the counts of any response groups containing NA
values. This is the opposite of what is written in the help file.
唯一的解决方法是传递na.action = NULL
,因为这将允许model.frame
保留NA
值,但也会导致sum
函数默认为na.rm
.
The only way round this is to pass na.action = NULL
, since this will allow model.frame
to keep NA
values, but will also cause the sum
function to default to na.rm
.
TL; DR 在这一点上,xtabs
的文档是错误的.
TL;DR The documentation for xtabs
is wrong on this point.
这篇关于Xtabs和R中的聚合之间的行为不一致的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!