如何调试“对比只能应用于具有 2 个或更多级别的因素"错误? [英] How to debug "contrasts can be applied only to factors with 2 or more levels" error?

查看:48
本文介绍了如何调试“对比只能应用于具有 2 个或更多级别的因素"错误?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我正在使用的所有变量:

Here are all the variables I'm working with:

str(ad.train)
$ Date                : Factor w/ 427 levels "2012-03-24","2012-03-29",..: 4 7 12 14 19 21 24 29 31 34 ...
 $ Team                : Factor w/ 18 levels "Adelaide","Brisbane Lions",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Season              : int  2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
 $ Round               : Factor w/ 28 levels "EF","GF","PF",..: 5 16 21 22 23 24 25 26 27 6 ...
 $ Score               : int  137 82 84 96 110 99 122 124 49 111 ...
 $ Margin              : int  69 18 -56 46 19 5 50 69 -26 29 ...
 $ WinLoss             : Factor w/ 2 levels "0","1": 2 2 1 2 2 2 2 2 1 2 ...
 $ Opposition          : Factor w/ 18 levels "Adelaide","Brisbane Lions",..: 8 18 10 9 13 16 7 3 4 6 ...
 $ Venue               : Factor w/ 19 levels "Adelaide Oval",..: 4 7 10 7 7 13 7 6 7 15 ...
 $ Disposals           : int  406 360 304 370 359 362 365 345 324 351 ...
 $ Kicks               : int  252 215 170 225 221 218 224 230 205 215 ...
 $ Marks               : int  109 102 52 41 95 78 93 110 69 85 ...
 $ Handballs           : int  154 145 134 145 138 144 141 115 119 136 ...
 $ Goals               : int  19 11 12 13 16 15 19 19 6 17 ...
 $ Behinds             : int  19 14 9 16 11 6 7 9 12 6 ...
 $ Hitouts             : int  42 41 34 47 45 70 48 54 46 34 ...
 $ Tackles             : int  73 53 51 76 65 63 65 67 77 58 ...
 $ Rebound50s          : int  28 34 23 24 32 48 39 31 34 29 ...
 $ Inside50s           : int  73 49 49 56 61 45 47 50 49 48 ...
 $ Clearances          : int  39 33 38 52 37 43 43 48 37 52 ...
 $ Clangers            : int  47 38 44 62 49 46 32 24 31 41 ...
 $ FreesFor            : int  15 14 15 18 17 15 19 14 18 20 ...
 $ ContendedPossessions: int  152 141 149 192 138 164 148 151 160 155 ...
 $ ContestedMarks      : int  10 16 11 3 12 12 17 14 15 11 ...
 $ MarksInside50       : int  16 13 10 8 12 9 14 13 6 12 ...
 $ OnePercenters       : int  42 54 30 58 24 56 32 53 50 57 ...
 $ Bounces             : int  1 6 4 4 1 7 11 14 0 4 ...
 $ GoalAssists         : int  15 6 9 10 9 12 13 14 5 14 ...

这是我想要适应的 glm:

Here's the glm I'm trying to fit:

ad.glm.all <- glm(WinLoss ~ factor(Team) + Season  + Round + Score  + Margin + Opposition + Venue + Disposals + Kicks + Marks + Handballs + Goals + Behinds + Hitouts + Tackles + Rebound50s + Inside50s+ Clearances+ Clangers+ FreesFor + ContendedPossessions + ContestedMarks + MarksInside50 + OnePercenters + Bounces+GoalAssists, 
                  data = ad.train, family = binomial(logit))

我知道它有很多变量(计划是通过前向变量选择来减少).但即使知道有很多变量,它们要么是 int 要么是 Factor;据我所知,事情应该只适用于 glm.但是,每次我尝试拟合这个模型时,我都会得到:

I know it's a lot of variables (plan is to reduce via forward variable selection). But even know it's a lot of variables they're either int or Factor; which as I understand things should just work with a glm. However, every time I try to fit this model I get:

Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : contrasts can be applied only to factors with 2 or more levels

在我看来,哪种类型的 R 没有将我的 Factor 变量视为 Factor 变量?

Which sort of looks to me as if R isn't treating my Factor variables as Factor variables for some reason?

甚至像这样简单的事情:

Even something as simple as:

ad.glm.test <- glm(WinLoss ~ factor(Team), data = ad.train, family = binomial(logit))

不工作!(同样的错误信息)

isn't working! (same error message)

在哪里:

ad.glm.test <- glm(WinLoss ~ Clearances, data = ad.train, family = binomial(logit))

会起作用!

有人知道这是怎么回事吗?为什么我不能将这些 Factor 变量拟合到我的 glm 中?

Anyone know what's going on here? Why can't I fit these Factor variables to my glm??

提前致谢!

-特洛伊

推荐答案

简介

什么是对比错误"已经得到很好的解释:你有一个只有一个级别(或更少)的因子.但实际上,这个简单的事实很容易被掩盖,因为实际用于模型拟合的数据可能与您传入的数据大不相同.当您的数据中有 NA 时,就会发生这种情况,您对您的数据进行了子集化,一个因子具有未使用的水平,或者您已经转换了变量并在某处获得 NaN.您很少处于这种可以直接从 str(your_data_frame) 中发现单级因素的理想情况. StackOverflow 上关于此错误的许多问题都无法重现,因此人们的建议可能有效,也可能无效.因此,虽然现在有 118个帖子关于这个问题,用户仍然找不到自适应的解决方案,所以这个问题一次又一次地被提出.这个答案是我的尝试,一劳永逸"解决这个问题,或者至少提供一个合理的指导.

Introduction

What a "contrasts error" is has been well explained: you have a factor that only has one level (or less). But in reality this simple fact can be easily obscured because the data that are actually used for model fitting can be very different from what you've passed in. This happens when you have NA in your data, you've subsetted your data, a factor has unused levels, or you've transformed your variables and get NaN somewhere. You are rarely in this ideal situation where a single-level factor can be spotted from str(your_data_frame) directly. Many questions on StackOverflow regarding this error are not reproducible, thus suggestions by people may or may not work. Therefore, although there are by now 118 posts regarding this issue, users still can't find an adaptive solution so that this question is raised again and again. This answer is my attempt, to solve this matter "once for all", or at least to provide a reasonable guide.

这个回答的信息量比较大,先做一个简单的总结.

This answer has rich information, so let me first make a quick summary.

我为你定义了 3 个辅助函数:debug_contr_errordebug_contr_error2NA_preproc.

I defined 3 helper functions for you: debug_contr_error, debug_contr_error2, NA_preproc.

我建议您按以下方式使用它们.

I recommend you use them in the following way.

  1. 运行 NA_preproc 以获得更完整的案例;
  2. 运行您的模型,如果出现对比错误",请使用 debug_contr_error2 进行调试.
  1. run NA_preproc to get more complete cases;
  2. run your model, and if you get a "contrasts error", use debug_contr_error2 for debugging.

大部分答案会一步步向您展示如何&为什么定义这些函数.跳过这些开发过程可能没有坏处,但不要跳过可重现的案例研究和讨论"中的部分.

Most of the answer shows you step by step how & why these functions are defined. There is probably no harm to skip those development process, but don't skip sections from "Reproducible case studies and Discussions".

原始答案 非常适合 OP,并且 已成功帮助了其他一些人.但是由于缺乏适应性而在其他地方失败了.查看问题中 str(ad.train) 的输出.OP 的变量是数字或因子;没有字符.最初的答案是针对这种情况的.如果您有字符变量,尽管它们会在 lmglm 拟合期间被强制为因子,但代码不会报告它们,因为它们不是作为因子提供的is.factor 会错过它们.在这个扩展中,我将使原始答案更具适应性.

The original answer works perfectly for OP, and has successfully helped some others. But it had failed somewhere else for lack of adaptiveness. Look at the output of str(ad.train) in the question. OP's variables are numeric or factors; there are no characters. The original answer was for this situation. If you have character variables, although they will be coerced to factors during lm and glm fitting, they won't be reported by the code since they were not provided as factors so is.factor will miss them. In this expansion I will make the original answer both more adaptive.

dat 成为传递给 lmglm 的数据集.如果你没有现成的数据框,也就是你所有的变量都分散在全局环境中,你需要把它们收集到一个数据框里.以下可能不是最好的方法,但它有效.

Let dat be your dataset passed to lm or glm. If you don't readily have such a data frame, that is, all your variables are scattered in the global environment, you need to gather them into a data frame. The following may not be the best way but it works.

## `form` is your model formula, here is an example
y <- x1 <- x2 <- x3 <- 1:4
x4 <- matrix(1:8, 4)
form <- y ~ bs(x1) + poly(x2) + I(1 / x3) + x4

## to gather variables `model.frame.default(form)` is the easiest way 
## but it does too much: it drops `NA` and transforms variables
## we want something more primitive

## first get variable names
vn <- all.vars(form)
#[1] "y"  "x1" "x2" "x3" "x4"

## `get_all_vars(form)` gets you a data frame
## but it is buggy for matrix variables so don't use it
## instead, first use `mget` to gather variables into a list
lst <- mget(vn)

## don't do `data.frame(lst)`; it is buggy with matrix variables
## need to first protect matrix variables by `I()` then do `data.frame`
lst_protect <- lapply(lst, function (x) if (is.matrix(x)) I(x) else x)
dat <- data.frame(lst_protect)
str(dat)
#'data.frame':  4 obs. of  5 variables:
# $ y : int  1 2 3 4
# $ x1: int  1 2 3 4
# $ x2: int  1 2 3 4
# $ x3: int  1 2 3 4
# $ x4: 'AsIs' int [1:4, 1:2] 1 2 3 4 5 6 7 8

## note the 'AsIs' for matrix variable `x4`
## in comparison, try the following buggy ones yourself
str(get_all_vars(form))
str(data.frame(lst))

第 0 步:显式子集化

如果您使用了 lmglmsubset 参数,请从显式子集开始:

If you've used the subset argument of lm or glm, start by an explicit subsetting:

## `subset_vec` is what you pass to `lm` via `subset` argument
## it can either be a logical vector of length `nrow(dat)`
## or a shorter positive integer vector giving position index
## note however, `base::subset` expects logical vector for `subset` argument
## so a rigorous check is necessary here
if (mode(subset_vec) == "logical") {
  if (length(subset_vec) != nrow(dat)) {
    stop("'logical' `subset_vec` provided but length does not match `nrow(dat)`")
    }
  subset_log_vec <- subset_vec
  } else if (mode(subset_vec) == "numeric") {
  ## check range
  ran <- range(subset_vec)
  if (ran[1] < 1 || ran[2] > nrow(dat)) {
    stop("'numeric' `subset_vec` provided but values are out of bound")
    } else {
    subset_log_vec <- logical(nrow(dat))
    subset_log_vec[as.integer(subset_vec)] <- TRUE
    } 
  } else {
  stop("`subset_vec` must be either 'logical' or 'numeric'")
  }
dat <- base::subset(dat, subset = subset_log_vec)

步骤 1:删除不完整的案例

dat <- na.omit(dat)

如果您已完成第 0 步,则可以跳过此步骤,因为 subset 会自动删除不完整的案例.

You can skip this step if you've gone through step 0, since subset automatically removes incomplete cases.

第 2 步:模式检查和转换

数据框列通常是一个原子向量,具有以下模式:逻辑"、数字"、复杂"、字符"、原始".对于回归,不同模式的变量处理方式不同.

A data frame column is usually an atomic vector, with a mode from the following: "logical", "numeric", "complex", "character", "raw". For regression, variables of different modes are handled differently.

"logical",   it depends
"numeric",   nothing to do
"complex",   not allowed by `model.matrix`, though allowed by `model.frame`
"character", converted to "numeric" with "factor" class by `model.matrix`
"raw",       not allowed by `model.matrix`, though allowed by `model.frame`

逻辑变量很棘手.它可以被视为一个虚拟变量(1 表示 TRUE0 表示 FALSE),因此是一个数字",或者它可以被强制为一个两级因素.这完全取决于 model.matrix 是否认为模型公式规范中的to-factor"强制是必要的.为简单起见,我们可以这样理解:它总是被强制为一个因子,但应用对比的结果可能最终得到相同的模型矩阵,就好像它被直接当作一个哑元处理一样.

A logical variable is tricky. It can either be treated as a dummy variable (1 for TRUE; 0 for FALSE) hence a "numeric", or it can be coerced to a two-level factor. It all depends on whether model.matrix thinks a "to-factor" coercion is necessary from the specification of your model formula. For simplicity we can understand it as such: it is always coerced to a factor, but the result of applying contrasts may end up with the same model matrix as if it were handled as a dummy directly.

有些人可能想知道为什么不包括整数".因为整数向量,比如 1:4,有一个数字"模式(试试 mode(1:4)).

Some people may wonder why "integer" is not included. Because an integer vector, like 1:4, has a "numeric" mode (try mode(1:4)).

数据框列也可能是具有AsIs"类的矩阵,但这样的矩阵必须具有数字"模式.

A data frame column may also be a matrix with "AsIs" class, but such a matrix must have "numeric" mode.

我们的检查是在什么时候产生错误

Our checking is to produce error when

  • 发现复杂"或原始";
  • 找到逻辑"或字符"矩阵变量;

并继续将逻辑"和字符"转换为因子"类的数字".

and proceed to convert "logical" and "character" to "numeric" of "factor" class.

## get mode of all vars
var_mode <- sapply(dat, mode)

## produce error if complex or raw is found
if (any(var_mode %in% c("complex", "raw"))) stop("complex or raw not allowed!")

## get class of all vars
var_class <- sapply(dat, class)

## produce error if an "AsIs" object has "logical" or "character" mode
if (any(var_mode[var_class == "AsIs"] %in% c("logical", "character"))) {
  stop("matrix variables with 'AsIs' class must be 'numeric'")
  }

## identify columns that needs be coerced to factors
ind1 <- which(var_mode %in% c("logical", "character"))

## coerce logical / character to factor with `as.factor`
dat[ind1] <- lapply(dat[ind1], as.factor)

请注意,如果数据框列已经是因子变量,则它不会包含在 ind1 中,因为因子变量具有数字"模式(尝试 mode(factor(letters[1:4]))).

Note that if a data frame column is already a factor variable, it will not be included in ind1, as a factor variable has "numeric" mode (try mode(factor(letters[1:4]))).

第 3 步:删除未使用的因子水平

对于从步骤 2 转换的因子变量,即由 ind1 索引的变量,我们不会有未使用的因子水平.但是,dat 附带的因子变量可能具有未使用的水平(通常是第 0 步和第 1 步的结果).我们需要从它们中删除任何可能未使用的级别.

We won't have unused factor levels for factor variables converted from step 2, i.e., those indexed by ind1. However, factor variables that come with dat might have unused levels (often as the result of step 0 and step 1). We need to drop any possible unused levels from them.

## index of factor columns
fctr <- which(sapply(dat, is.factor))

## factor variables that have skipped explicit conversion in step 2
## don't simply do `ind2 <- fctr[-ind1]`; buggy if `ind1` is `integer(0)`
ind2 <- if (length(ind1) > 0L) fctr[-ind1] else fctr

## drop unused levels
dat[ind2] <- lapply(dat[ind2], droplevels)

第 4 步:汇总因子变量

现在我们准备看看 lmglm 实际使用了什么以及有多少因子水平:

Now we are ready to see what and how many factor levels are actually used by lm or glm:

## export factor levels actually used by `lm` and `glm`
lev <- lapply(dat[fctr], levels)

## count number of levels
nl <- lengths(lev)

<小时>

为了让您的生活更轻松,我将这些步骤封装到一个函数 debug_contr_error 中.

输入:

  • dat 是通过 data 参数传递给 lmglm 的数据框;
  • subset_vec 是通过 subset 参数传递给 lmglm 的索引向量.
  • dat is your data frame passed to lm or glm via data argument;
  • subset_vec is the index vector passed to lm or glm via subset argument.

输出:一个带有

  • nlevels(一个列表)给出所有因子变量的因子水平数;
  • levels(一个向量)给出所有因子变量的水平.
  • nlevels (a list) gives the number of factor levels for all factor variables;
  • levels (a vector) gives levels for all factor variables.

如果没有完整的个案或没有要汇总的因子变量,该函数会产生警告.

The function produces a warning, if there are no complete cases or no factor variables to summarize.

debug_contr_error <- function (dat, subset_vec = NULL) {
  if (!is.null(subset_vec)) {
    ## step 0
    if (mode(subset_vec) == "logical") {
      if (length(subset_vec) != nrow(dat)) {
        stop("'logical' `subset_vec` provided but length does not match `nrow(dat)`")
        }
      subset_log_vec <- subset_vec
      } else if (mode(subset_vec) == "numeric") {
      ## check range
      ran <- range(subset_vec)
      if (ran[1] < 1 || ran[2] > nrow(dat)) {
        stop("'numeric' `subset_vec` provided but values are out of bound")
        } else {
        subset_log_vec <- logical(nrow(dat))
        subset_log_vec[as.integer(subset_vec)] <- TRUE
        } 
      } else {
      stop("`subset_vec` must be either 'logical' or 'numeric'")
      }
    dat <- base::subset(dat, subset = subset_log_vec)
    } else {
    ## step 1
    dat <- stats::na.omit(dat)
    }
  if (nrow(dat) == 0L) warning("no complete cases")
  ## step 2
  var_mode <- sapply(dat, mode)
  if (any(var_mode %in% c("complex", "raw"))) stop("complex or raw not allowed!")
  var_class <- sapply(dat, class)
  if (any(var_mode[var_class == "AsIs"] %in% c("logical", "character"))) {
    stop("matrix variables with 'AsIs' class must be 'numeric'")
    }
  ind1 <- which(var_mode %in% c("logical", "character"))
  dat[ind1] <- lapply(dat[ind1], as.factor)
  ## step 3
  fctr <- which(sapply(dat, is.factor))
  if (length(fctr) == 0L) warning("no factor variables to summary")
  ind2 <- if (length(ind1) > 0L) fctr[-ind1] else fctr
  dat[ind2] <- lapply(dat[ind2], base::droplevels.factor)
  ## step 4
  lev <- lapply(dat[fctr], base::levels.default)
  nl <- lengths(lev)
  ## return
  list(nlevels = nl, levels = lev)
  }

这是一个构建的小例子.

Here is a constructed tiny example.

dat <- data.frame(y = 1:4,
                  x = c(1:3, NA),
                  f1 = gl(2, 2, labels = letters[1:2]),
                  f2 = c("A", "A", "A", "B"),
                  stringsAsFactors = FALSE)

#  y  x f1 f2
#1 1  1  a  A
#2 2  2  a  A
#3 3  3  b  A
#4 4 NA  b  B

str(dat)
#'data.frame':  4 obs. of  4 variables:
# $ y : int  1 2 3 4
# $ x : int  1 2 3 NA
# $ f1: Factor w/ 2 levels "a","b": 1 1 2 2
# $ f2: chr  "A" "A" "A" "B"

lm(y ~ x + f1 + f2, dat)
#Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : 
#  contrasts can be applied only to factors with 2 or more levels

好的,我们看到一个错误.现在我的 debug_contr_error 暴露了 f2 最终只有一个级别.

Good, we see an error. Now my debug_contr_error exposes that f2 ends up with a single level.

debug_contr_error(dat)
#$nlevels
#f1 f2 
# 2  1 
#
#$levels
#$levels$f1
#[1] "a" "b"
#
#$levels$f2
#[1] "A"

请注意,这里的原始简短答案是无望的,因为 f2 是作为字符变量而不是因子变量提供的.

Note that the original short answer is hopeless here, as f2 is provided as a character variable not a factor variable.

## old answer
tmp <- na.omit(dat)
fctr <- lapply(tmp[sapply(tmp, is.factor)], droplevels)
sapply(fctr, nlevels)
#f1 
# 2 
rm(tmp, fctr)

现在让我们看一个矩阵变量 x 的例子.

Now let's see an example with a matrix variable x.

dat <- data.frame(X = I(rbind(matrix(1:6, 3), NA)),
                  f = c("a", "a", "a", "b"),
                  y = 1:4)

dat
#  X.1 X.2 f y
#1   1   4 a 1
#2   2   5 a 2
#3   3   6 a 3
#4  NA  NA b 4

str(dat)
#'data.frame':  4 obs. of  3 variables:
# $ X: 'AsIs' int [1:4, 1:2] 1 2 3 NA 4 5 6 NA
# $ f: Factor w/ 2 levels "a","b": 1 1 1 2
# $ y: int  1 2 3 4

lm(y ~ X + f, data = dat)
#Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : 
#  contrasts can be applied only to factors with 2 or more levels

debug_contr_error(dat)$nlevels
#f 
#1

请注意,没有水平的因子变量也会导致对比错误".您可能想知道 0 级因子是如何可能的.那么它是合法的:nlevels(factor(character(0))).如果您没有完整的案例,在这里您将得到 0 级因子.

Note that a factor variable with no levels can cause an "contrasts error", too. You may wonder how a 0-level factor is possible. Well it is legitimate: nlevels(factor(character(0))). Here you will end up with a 0-level factors if you have no complete cases.

dat <- data.frame(y = 1:4,
                  x = rep(NA_real_, 4),
                  f1 = gl(2, 2, labels = letters[1:2]),
                  f2 = c("A", "A", "A", "B"),
                  stringsAsFactors = FALSE)

lm(y ~ x + f1 + f2, dat)
#Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : 
#  contrasts can be applied only to factors with 2 or more levels

debug_contr_error(dat)$nlevels
#f1 f2 
# 0  0    ## all values are 0
#Warning message:
#In debug_contr_error(dat) : no complete cases

最后让我们看看如果 f2 是一个逻辑变量的情况.

Finally let's see some a situation where if f2 is a logical variable.

dat <- data.frame(y = 1:4,
                  x = c(1:3, NA),
                  f1 = gl(2, 2, labels = letters[1:2]),
                  f2 = c(TRUE, TRUE, TRUE, FALSE))

dat
#  y  x f1    f2
#1 1  1  a  TRUE
#2 2  2  a  TRUE
#3 3  3  b  TRUE
#4 4 NA  b FALSE

str(dat)
#'data.frame':  4 obs. of  4 variables:
# $ y : int  1 2 3 4
# $ x : int  1 2 3 NA
# $ f1: Factor w/ 2 levels "a","b": 1 1 2 2
# $ f2: logi  TRUE TRUE TRUE FALSE

我们的调试器会预测一个对比错误",但它真的会发生吗?

Our debugger will predict a "contrasts error", but will it really happen?

debug_contr_error(dat)$nlevels
#f1 f2 
# 2  1 

不,至少这一项不会失败(NA 系数是由于秩亏模型;别担心):

No, at least this one does not fail (the NA coefficient is due to the rank-deficiency of the model; don't worry):

lm(y ~ x + f1 + f2, data = dat)
#Coefficients:
#(Intercept)            x          f1b       f2TRUE  
#          0            1            0           NA

我很难想出一个给出错误的例子,但也没有必要.在实践中,我们不使用调试器进行预测;我们在真正出错时使用它;在这种情况下,调试器可以找到违规因素变量.

It is difficult for me to come up with an example giving an error, but there is also no need. In practice, we don't use the debugger for prediction; we use it when we really get an error; and in that case, the debugger can locate the offending factor variable.

也许有些人可能会争辩说逻辑变量与虚拟变量没有什么不同.但是请尝试下面的简单示例:它确实取决于您的公式.

Perhaps some may argue that a logical variable is no different to a dummy. But try the simple example below: it does depends on your formula.

u <- c(TRUE, TRUE, FALSE, FALSE)
v <- c(1, 1, 0, 0)  ## "numeric" dummy of `u`

model.matrix(~ u)
#  (Intercept) uTRUE
#1           1     1
#2           1     1
#3           1     0
#4           1     0

model.matrix(~ v)
#  (Intercept) v
#1           1 1
#2           1 1
#3           1 0
#4           1 0

model.matrix(~ u - 1)
#  uFALSE uTRUE
#1      0     1
#2      0     1
#3      1     0
#4      1     0

model.matrix(~ v - 1)
#  v
#1 1
#2 1
#3 0
#4 0

<小时>

使用lm

"model.frame"方法更灵活的实现

还建议您通过R:如何调试因子有新水平";线性模型和预测的错误,它解释了 lmglm 在你的数据集上做了什么.你会明白上面列出的步骤 0 到 4 只是试图模仿这样的内部过程.请记住,实际用于模型拟合的数据可能与您传入的数据大不相同.


More flexible implementation using "model.frame" method of lm

You are also advised to go through R: how to debug "factor has new levels" error for linear model and prediction, which explains what lm and glm do under the hood on your dataset. You will understand that steps 0 to 4 listed above are just trying to mimic such internal process. Remember, the data that are actually used for model fitting can be very different from what you've passed in.

我们的步骤与这样的内部处理并不完全一致.为了进行比较,您可以通过在lmglm 中使用method = "model.frame" 来检索内部处理的结果.在之前构建的小例子 dat 上试试这个,其中 f2 是一个字符变量.

Our steps are not completely consistent with such internal processing. For a comparison, you can retrieve the result of the internal processing by using method = "model.frame" in lm and glm. Try this on the previously constructed tiny example dat where f2 is a character variable.

dat_internal <- lm(y ~ x + f1 + f2, dat, method = "model.frame")

dat_internal
#  y x f1 f2
#1 1 1  a  A
#2 2 2  a  A
#3 3 3  b  A

str(dat_internal)
#'data.frame':  3 obs. of  4 variables:
# $ y : int  1 2 3
# $ x : int  1 2 3
# $ f1: Factor w/ 2 levels "a","b": 1 1 2
# $ f2: chr  "A" "A" "A"
## [.."terms" attribute is truncated..]

实际上,model.frame 只会执行第 0 步和第 1 步.它还会删除数据集中提供的变量,但不会删除模型公式中的变量.因此,模型框架的行和列可能比您提供的 lmglm 少.在我们的第 2 步中完成的类型强制由后面的 model.matrix 完成,其中可能会产生对比错误".

In practice, model.frame will only perform step 0 and step 1. It also drops variables provided in your dataset but not in your model formula. So a model frame may have both fewer rows and columns than what you feed lm and glm. Type coercing as done in our step 2 is done by the later model.matrix where a "contrasts error" may be produced.

首先获取这个内部模型框架,然后将其传递给 debug_contr_error 有几个好处(这样它基本上只执行步骤 2 到 4).

There are a few advantages to first get this internal model frame, then pass it to debug_contr_error (so that it only essentially performs steps 2 to 4).

优点 1:模型公式中未使用的变量将被忽略

## no variable `f1` in formula
dat_internal <- lm(y ~ x + f2, dat, method = "model.frame")

## compare the following
debug_contr_error(dat)$nlevels
#f1 f2 
# 2  1 

debug_contr_error(dat_internal)$nlevels
#f2 
# 1 

优势二:能够应对变换后的变量

对模型公式中的变量进行变换是有效的,model.frame会记录变换后的而不是原来的.请注意,即使您的原始变量没有 NA,转换后的变量也可以有.

It is valid to transform variables in the model formula, and model.frame will record the transformed ones instead of the original ones. Note that, even if your original variable has no NA, the transformed one can have.

dat <- data.frame(y = 1:4, x = c(1:3, -1), f = rep(letters[1:2], c(3, 1)))
#  y  x f
#1 1  1 a
#2 2  2 a
#3 3  3 a
#4 4 -1 b

lm(y ~ log(x) + f, data = dat)
#Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : 
#  contrasts can be applied only to factors with 2 or more levels
#In addition: Warning message:
#In log(x) : NaNs produced

# directly using `debug_contr_error` is hopeless here
debug_contr_error(dat)$nlevels
#f 
#2 

## this works
dat_internal <- lm(y ~ log(x) + f, data = dat, method = "model.frame")
#  y    log(x) f
#1 1 0.0000000 a
#2 2 0.6931472 a
#3 3 1.0986123 a

debug_contr_error(dat_internal)$nlevels
#f 
#1

考虑到这些好处,我编写了另一个包含 model.framedebug_contr_error 的函数.

Given these benefits, I write another function wrapping up model.frame and debug_contr_error.

输入:

  • form 是你的模型公式;
  • dat 是通过 data 参数传递给 lmglm 的数据集;
  • subset_vec 是通过 subset 参数传递给 lmglm 的索引向量.
  • form is your model formula;
  • dat is the dataset passed to lm or glm via data argument;
  • subset_vec is the index vector passed to lm or glm via subset argument.

输出:一个带有

  • mf(一个数据框)给出模型框(去掉terms"属性);
  • nlevels(一个列表)给出所有因子变量的因子水平数;
  • levels(一个向量)给出所有因子变量的水平.
  • mf (a data frame) gives the model frame (with "terms" attribute dropped);
  • nlevels (a list) gives the number of factor levels for all factor variables;
  • levels (a vector) gives levels for all factor variables.

## note: this function relies on `debug_contr_error`
debug_contr_error2 <- function (form, dat, subset_vec = NULL) {
  ## step 0
  if (!is.null(subset_vec)) {
    if (mode(subset_vec) == "logical") {
      if (length(subset_vec) != nrow(dat)) {
        stop("'logical' `subset_vec` provided but length does not match `nrow(dat)`")
        }
      subset_log_vec <- subset_vec
      } else if (mode(subset_vec) == "numeric") {
      ## check range
      ran <- range(subset_vec)
      if (ran[1] < 1 || ran[2] > nrow(dat)) {
        stop("'numeric' `subset_vec` provided but values are out of bound")
        } else {
        subset_log_vec <- logical(nrow(dat))
        subset_log_vec[as.integer(subset_vec)] <- TRUE
        } 
      } else {
      stop("`subset_vec` must be either 'logical' or 'numeric'")
      }
    dat <- base::subset(dat, subset = subset_log_vec)
    }
  ## step 0 and 1
  dat_internal <- stats::lm(form, data = dat, method = "model.frame")
  attr(dat_internal, "terms") <- NULL
  ## rely on `debug_contr_error` for steps 2 to 4
  c(list(mf = dat_internal), debug_contr_error(dat_internal, NULL))
  }

尝试前面的 log 转换示例.

Try the previous log transform example.

debug_contr_error2(y ~ log(x) + f, dat)
#$mf
#  y    log(x) f
#1 1 0.0000000 a
#2 2 0.6931472 a
#3 3 1.0986123 a
#
#$nlevels
#f 
#1 
#
#$levels
#$levels$f
#[1] "a"
#
#
#Warning message:
#In log(x) : NaNs produced

也试试 subset_vec.

## or: debug_contr_error2(y ~ log(x) + f, dat, c(T, F, T, T))
debug_contr_error2(y ~ log(x) + f, dat, c(1,3,4))
#$mf
#  y   log(x) f
#1 1 0.000000 a
#3 3 1.098612 a
#
#$nlevels
#f 
#1 
#
#$levels
#$levels$f
#[1] "a"
#
#
#Warning message:
#In log(x) : NaNs produced

<小时>

每组模型拟合,NA作为因子水平

如果您按组拟合模型,则更有可能出现对比错误".您需要

  1. 按分组变量拆分数据框(请参阅?split.data.frame);
  2. 应用debug_contr_error2(lapply 函数可以帮助执行此循环),逐个处理这些数据帧.
  1. split your data frame by the grouping variable (see ?split.data.frame);
  2. work through those data frames one by one, applying debug_contr_error2 (lapply function can be helpful to do this loop).

有些人还告诉我他们不能在他们的数据上使用 na.omit,因为它会结束行太少,无法做任何明智的事情. 这可以放宽.实际上是NA_integer_NA_real_必须省略,但NA_character_可以保留:只需添加NA> 作为因子水平.为此,您需要遍历数据框中的变量:

Some also told me that they can not use na.omit on their data, because it will end up too few rows to do anything sensible. This can be relaxed. In practice it is the NA_integer_ and NA_real_ that have to be omitted, but NA_character_ can be retained: just add NA as a factor level. To achieve this, you need to loop through variables in your data frame:

  • 如果一个变量x已经是一个因子并且anyNA(x)TRUE,那么做x <- addNA(x).和"很重要.如果 x 没有 NAaddNA(x) 将添加一个未使用的 级别.莉>
  • 如果变量 x 是一个字符,则执行 x <- factor(x, exclude = NULL) 将其强制为一个因子.exclude = NULL 将保留 作为一个级别.
  • 如果 x 是logical"、numeric"、raw"或complex",则不应更改任何内容.NA 就是 NA.
  • if a variable x is already a factor and anyNA(x) is TRUE, do x <- addNA(x). The "and" is important. If x has no NA, addNA(x) will add an unused <NA> level.
  • if a variable x is a character, do x <- factor(x, exclude = NULL) to coerce it to a factor. exclude = NULL will retain <NA> as a level.
  • if x is "logical", "numeric", "raw" or "complex", nothing should be changed. NA is just NA.

因子级别不会被 droplevelsna.omit 丢弃,对构建模型矩阵有效.检查以下示例.

<NA> factor level will not be dropped by droplevels or na.omit, and it is valid for building a model matrix. Check the following examples.

## x is a factor with NA

x <- factor(c(letters[1:4], NA))  ## default: `exclude = NA`
#[1] a    b    c    d    <NA>     ## there is an NA value
#Levels: a b c d                  ## but NA is not a level

na.omit(x)  ## NA is gone
#[1] a b c d
#[.. attributes truncated..]
#Levels: a b c d

x <- addNA(x)  ## now add NA into a valid level
#[1] a    b    c    d    <NA>
#Levels: a b c d <NA>  ## it appears here

droplevels(x)    ## it can not be dropped
#[1] a    b    c    d    <NA>
#Levels: a b c d <NA>

na.omit(x)  ## it is not omitted
#[1] a    b    c    d    <NA>
#Levels: a b c d <NA>

model.matrix(~ x)   ## and it is valid to be in a design matrix
#  (Intercept) xb xc xd xNA
#1           1  0  0  0   0
#2           1  1  0  0   0
#3           1  0  1  0   0
#4           1  0  0  1   0
#5           1  0  0  0   1

## x is a character with NA

x <- c(letters[1:4], NA)
#[1] "a" "b" "c" "d" NA 

as.factor(x)  ## this calls `factor(x)` with default `exclude = NA`
#[1] a    b    c    d    <NA>     ## there is an NA value
#Levels: a b c d                  ## but NA is not a level

factor(x, exclude = NULL)      ## we want `exclude = NULL`
#[1] a    b    c    d    <NA>
#Levels: a b c d <NA>          ## now NA is a level

一旦您在因子/字符中添加 NA 作为级别,您的数据集可能会突然拥有更完整的案例.然后你可以运行你的模型.如果您仍然收到对比错误",请使用 debug_contr_error2 查看发生了什么.

Once you add NA as a level in a factor / character, your dataset might suddenly have more complete cases. Then you can run your model. If you still get a "contrasts error", use debug_contr_error2 to see what has happened.

为了方便起见,我为此NA预处理编写了一个函数.

For your convenience, I write a function for this NA preprocessing.

输入:

  • dat 是您的完整数据集.
  • dat is your full dataset.

输出:

  • 一个数据框,添加了 NA 作为因子/字符的级别.

NA_preproc <- function (dat) {
  for (j in 1:ncol(dat)) {
    x <- dat[[j]]
    if (is.factor(x) && anyNA(x)) dat[[j]] <- base::addNA(x)
    if (is.character(x)) dat[[j]] <- factor(x, exclude = NULL)
    }
  dat
  }

<小时>

可重复的案例研究和讨论

以下内容是专门为可重复的案例研究而选择的,因为我刚刚使用此处创建的三个辅助函数回答了这些问题.


Reproducible case studies and Discussions

The followings are specially selected for reproducible case studies, as I just answered them with the three helper functions created here.

还有其他一些 StackOverflow 用户解决的其他优质线程:

There are also a few other good-quality threads solved by other StackOverflow users:

  • Factors not being recognised in a lm using map() (this is about model fitting by group)
  • How to drop NA observation of factors conditionally when doing linear regression in R? (this is similar to case 1 in the previous list)
  • Factor/level error in mixed model (another post about model fitting by group)

这个答案旨在调试模型拟合过程中的对比错误".但是,当使用 predict 进行预测时,也会出现此错误.这种行为不是在 predict.lmpredict.glm 中,而是在一些包中的 predict 方法中.以下是 StackOverflow 上的一些相关线程.

This answer aims to debug the "contrasts error" during model fitting. However, this error can also turn up when using predict for prediction. Such behavior is not with predict.lm or predict.glm, but with predict methods from some packages. Here are a few related threads on StackOverflow.

另请注意,此答案的原理基于 lmglm 的原理.These two functions are a coding standard for many model fitting routines, but maybe not all model fitting routines behave similarly.For example, the following does not look transparent to me whether my helper functions would actually be helpful.

Also note that the philosophy of this answer is based on that of lm and glm. These two functions are a coding standard for many model fitting routines, but maybe not all model fitting routines behave similarly. For example, the following does not look transparent to me whether my helper functions would actually be helpful.

Although a bit off-topic, it is still useful to know that sometimes a "contrasts error" merely comes from writing a wrong piece of code. In the following examples, OP passed the name of their variables rather than their values to lm. Since a name is a single value character, it is later coerced to a single-level factor and causes the error.

Although a bit off-topic, it is still useful to know that sometimes a "contrasts error" merely comes from writing a wrong piece of code. In the following examples, OP passed the name of their variables rather than their values to lm. Since a name is a single value character, it is later coerced to a single-level factor and causes the error.

In practice people want to know how to resolve this matter, either at a statistical level or a programming level.

In practice people want to know how to resolve this matter, either at a statistical level or a programming level.

If you are fitting models on your complete dataset, then there is probably no statistical solution, unless you can impute missing values or collect more data. Thus you may simply turn to a coding solution to drop the offending variable. debug_contr_error2 returns nlevels which helps you easily locate them. If you don't want to drop them, replace them by a vector of 1 (as explained in How to do a GLM when "contrasts can be applied only to factors with 2 or more levels"?) and let lm or glm deal with the resulting rank-deficiency.

If you are fitting models on your complete dataset, then there is probably no statistical solution, unless you can impute missing values or collect more data. Thus you may simply turn to a coding solution to drop the offending variable. debug_contr_error2 returns nlevels which helps you easily locate them. If you don't want to drop them, replace them by a vector of 1 (as explained in How to do a GLM when "contrasts can be applied only to factors with 2 or more levels"?) and let lm or glm deal with the resulting rank-deficiency.

If you are fitting models on subset, there can be statistical solutions.

If you are fitting models on subset, there can be statistical solutions.

Fitting models by group does not necessarily require you splitting your dataset by group and fitting independent models. The following may give you a rough idea:

Fitting models by group does not necessarily require you splitting your dataset by group and fitting independent models. The following may give you a rough idea:

If you do split your data explicitly, you can easily get "contrasts error", thus have to adjust your model formula per group (that is, you need to dynamically generate model formulae). A simpler solution is to skip building a model for this group.

If you do split your data explicitly, you can easily get "contrasts error", thus have to adjust your model formula per group (that is, you need to dynamically generate model formulae). A simpler solution is to skip building a model for this group.

You may also randomly partition your dataset into a training subset and a testing subset so that you can do cross-validation. R: how to debug "factor has new levels" error for linear model and prediction briefly mentions this, and you'd better do a stratified sampling to ensure the success of both model estimation on the training part and prediction on the testing part.

You may also randomly partition your dataset into a training subset and a testing subset so that you can do cross-validation. R: how to debug "factor has new levels" error for linear model and prediction briefly mentions this, and you'd better do a stratified sampling to ensure the success of both model estimation on the training part and prediction on the testing part.

这篇关于如何调试“对比只能应用于具有 2 个或更多级别的因素"错误?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆