Xtabs和R中的聚合之间的行为不一致 [英] Inconsistency of na.action between xtabs and aggregate in R

查看:161
本文介绍了Xtabs和R中的聚合之间的行为不一致的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下data.frame:

I have the following data.frame:

x <- data.frame(A = c("Y", "Y", "Z", NA),
                B = c(NA, TRUE, FALSE, TRUE),
                C = c(TRUE, TRUE, NA, FALSE))

我需要使用xtabs计算下表:

A      B C
  Y    1 2
  Z    0 0
  <NA> 1 0

我被告知要使用 ,它确实返回了我需要的表:

I was told to use na.action = NULL, which indeed returns the table I need:

xtabs(formula = cbind(B, C) ~ A,
      data = x,
      addNA = TRUE,
      na.action = NULL)

A      B C
  Y    1 2
  Z    0 0
  <NA> 1 0

但是,na.action = na.pass返回另一个表:

xtabs(formula = cbind(B, C) ~ A,
      data = x,
      addNA = TRUE,
      na.action = na.pass)

A       B  C
  Y        2
  Z     0   
  <NA>  1  0

但是xtabs的文档说:

na.action
当它是na.pass并且公式的左侧(带有计数)时, 使用sum(*,na.rm = TRUE)代替计数的sum(*).

na.action
When it is na.pass and formula has a left hand side (with counts), sum(*, na.rm = TRUE) is used instead of sum(*) for the counts.

对于aggregatena.action = na.pass返回预期结果(以及na.action = NULL):

With aggregate, na.action = na.pass returns the expected result (and also na.action = NULL):

aggregate(formula = cbind(B, C) ~ addNA(A),
          data = x,
          FUN = sum,
          na.rm = TRUE,
          na.action = na.pass) # same result with na.action = NULL

  addNA(A) B C
1            Y 1 2
2            Z 0 0
3         <NA> 1 0

尽管我可以通过xtabs获得所需的表,但是我不理解文档中xtabsna.action的行为.所以我的问题是:

Although I get the table I need with xtabs, I do not understand the behavior of na.action in xtabs from the documentation. So my questions are:

  • xtabsna.action的行为是否与文档一致?除非我缺少任何内容,否则na.action = na.pass不会导致sum(*, na.rm = TRUE).
  • na.action = NULL是否记录在某处?
  • xtabs源代码中有na.rm <- identical(naAct, quote(na.omit)) || identical(naAct, na.omit) || identical(naAct, "na.omit").但是对于na.action = na.passna.action = NULL我什么也没看到. na.action = na.passna.action = NULL如何工作?
  • Is the behavior of na.action in xtabs consistent with the documentation? Unless I am missing something, na.action = na.pass does not result in sum(*, na.rm = TRUE).
  • Is na.action = NULL documented somewhere?
  • In xtabs source code there is na.rm <- identical(naAct, quote(na.omit)) || identical(naAct, na.omit) || identical(naAct, "na.omit"). But I saw nothing for na.action = na.pass and na.action = NULL. How do na.action = na.pass and na.action = NULL work?

推荐答案

不给出xtabs的工作原理很难给出一个规范的答案.如果我们逐步研究其源代码的要点,那么我们将清楚地看到正在发生的事情.

It's difficult to give a cannonical answer without describing how xtabs works. If we step through the main points of its source code, we'll see clearly what's going on.

经过一些基本的类型检查之后,对xtabs的调用在内部起作用,首先使用stats::model.frame创建包含在公式中的所有变量的数据框,并为此传递了na.action参数.

After some basic type checking, the call to xtabs works internally by first creating a data frame of all the variables contained in your formula using stats::model.frame, and it is to this that the na.action parameter is passed.

执行此操作的方式非常聪明. xtabs首先复制您通过match.call对其进行的呼叫,如下所示:

The way it does this is quite clever. xtabs first copies the call you made to it via match.call, like this:

m <- match.call(expand.dots = FALSE)

然后剥离掉不需要传递给stats::model.frame的参数,如下所示:

Then it strips out the parameters that don't need passed to stats::model.frame like this:

m$... <- m$exclude <- m$drop.unused.levels <- m$sparse <- m$addNA <- NULL

如帮助文件中所承诺的那样,如果addNATRUE而缺少na.action,则现在默认为na.pass:

As promised in the help file, if addNA is TRUE and na.action is missing, it will now default to na.pass:

    if (addNA && missing(na.action)) 
        m$na.action <- quote(na.pass)

然后将要调用的函数从xtabs更改为stats::model.frame,如下所示:

Then it changes the function to be called from xtabs to stats::model.frame like this:

m[[1L]] <- quote(stats::model.frame)

所以对象m是一个调用(也是一个独立的reprex),在您的情况下,它看起来像这样:

So the object m is a call (and is also a standalone reprex), which in your case looks like this:

stats::model.frame(formula = cbind(B, C) ~ A, data = list(A = structure(c(1L, 
1L, 2L, NA), .Label = c("Y", "Z"), class = "factor"), B = c(NA, TRUE, FALSE, TRUE), 
C = c(TRUE, TRUE, NA, FALSE)), na.action = NULL)

请注意,您的na.action = NULL已传递给此呼叫.这具有将所有NA值保留在帧中的效果.评估上述调用后,将给出以下数据帧:

Note that your na.action = NULL has been passed to this call. This has the effect of keeping all NA values in the frame. When the above call is evaluated, it gives this data frame:

eval(m)
#>   cbind(B, C).B cbind(B, C).C    A
#> 1            NA          TRUE    Y
#> 2          TRUE          TRUE    Y
#> 3         FALSE            NA    Z
#> 4          TRUE         FALSE <NA>

请注意,这与通过na.action = na.pass会得到的结果相同:

Note that this is the same result you would get if you passed na.action = na.pass:

stats::model.frame(formula = cbind(B, C) ~ A, data = list(A = structure(c(1L, 
1L, 2L, NA), .Label = c("Y", "Z"), class = "factor"), B = c(NA, TRUE, FALSE, TRUE), 
C = c(TRUE, TRUE, NA, FALSE)), na.action = na.pass)
#>   cbind(B, C).B cbind(B, C).C    A
#> 1            NA          TRUE    Y
#> 2          TRUE          TRUE    Y
#> 3         FALSE            NA    Z
#> 4          TRUE         FALSE <NA>

但是,如果您通过了na.action = na.omit,则只剩下一行,因为只有第2行没有NA值.

However, if you passed na.action = na.omit, you would only be left with a single row, since only row 2 has no NA values.

在任何情况下,模型框架"结果都存储在变量mf中.然后将其分为自变量-在您的情况下为A列,在响应变量-为您的情况下cbind(B, C).

In any case, the "model frame" result is stored in the variable mf. This is then split into the independent variable(s), - in your case, column A, and the response variable - in your case cbind(B, C).

响应存储在y中,变量存储在by中:

The response is stored in y and the variable in by:

        i <- attr(attr(mf, "terms"), "response")
        by <- mf[-i]
        y <- mf[[i]]

现在,处理by以确保每个自变量都是一个因子,并且如果已指定addNA = TRUE,则将任何NA值都转换为因子水平:

Now, by is processed to ensure each independent variable is a factor, and that any NA values are converted into factor levels if you have specified addNA = TRUE:

    by <- lapply(by, function(u) {
        if (!is.factor(u)) 
            u <- factor(u, exclude = exclude)
        else if (has.exclude) 
            u <- factor(as.character(u), levels = setdiff(levels(u), 
                exclude), exclude = NULL)
        if (addNA) 
            u <- addNA(u, ifany = TRUE)
        u[, drop = drop.unused.levels]
    })

现在我们来看看症结所在.再次使用na.action来确定如何计算响应变量中的NA值.在您的情况下,由于传递了na.action = NULL,因此您会看到naAct将获得存储在getOption("na.action")中的值,如果您从未更改过该值,则应将其设置为na.omit.反过来,这将导致变量na.rm,的值为TRUE:

Now we come to the crux. The na.action is used again to determine how the NA values in the response variable will be counted. In your case, since you passed na.action = NULL, you will see that naAct will get the value stored in getOption("na.action"), which if you have never changed it, should be set to na.omit. This in turn will cause the value of the variable na.rm, to be TRUE:

    naAct <- if (!is.null(m$na.action)) {
        m$na.action
    }else {getOption("na.action", default = quote(na.omit))}
    na.rm <- identical(naAct, quote(na.omit)) || identical(naAct, 
        na.omit) || identical(naAct, "na.omit")

请注意,如果您已通过na.action = na.pass,那么如果跟踪这段代码,则na.rm将为FALSE.

Note that if you had passed na.action = na.pass, then na.rm would be FALSE if you trace this piece of code.

最后,我们来到使用tapply内的sum构建您的xtabs表的部分,而tapply本身在lapply内.

Finally, we come to the section where your xtabs table is built using sum inside a tapply, which is itself inside an lapply.

lapply(as.data.frame(y), tapply, by, sum, na.rm = na.rm, default = 0L)

您可以看到na.rm变量用于确定在尝试对列进行累加之前是否从列中删除NA.然后,该lapply的结果将被强制进入最终的交叉表.

You can see that the na.rm variable is used to determine whether to remove NAs from the columns before attempting to sum them. The result of this lapply is then coerced into the final cross tab.

那么这如何回答您的问题?

So how does this answer your question?

当文档说如果您不通过na.action时,它将默认为na.pass,这是正确的.但是,在两个地方使用na.action:一次是在调用model.frame的过程中,一次是确定na.rm的值.从源代码中非常清楚的是,如果na.actionna.pass,则na.rm将是FALSE,因此您将错过任何包含NA值的响应组的计数.这与帮助文件中的内容相反.

It is true when the documentation says that if you don't pass an na.action, it will default to na.pass. However, the na.action is used in two places: once in the call to model.frame and once to determine the value of na.rm. It is very clear from the source code that if na.action is na.pass, then na.rm will be FALSE, so you will miss out on the counts of any response groups containing NA values. This is the opposite of what is written in the help file.

唯一的解决方法是传递na.action = NULL,因为这将允许model.frame保留NA值,但也会导致sum函数默认为na.rm.

The only way round this is to pass na.action = NULL, since this will allow model.frame to keep NA values, but will also cause the sum function to default to na.rm.

TL; DR 在这一点上,xtabs的文档是错误的.

TL;DR The documentation for xtabs is wrong on this point.

这篇关于Xtabs和R中的聚合之间的行为不一致的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆