如何调试“对比度只能应用于具有两个或两个以上级别的因数".错误? [英] How to debug "contrasts can be applied only to factors with 2 or more levels" error?

查看:114
本文介绍了如何调试“对比度只能应用于具有两个或两个以上级别的因数".错误?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我正在使用的所有变量:

Here are all the variables I'm working with:

str(ad.train)
$ Date                : Factor w/ 427 levels "2012-03-24","2012-03-29",..: 4 7 12 14 19 21 24 29 31 34 ...
 $ Team                : Factor w/ 18 levels "Adelaide","Brisbane Lions",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Season              : int  2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
 $ Round               : Factor w/ 28 levels "EF","GF","PF",..: 5 16 21 22 23 24 25 26 27 6 ...
 $ Score               : int  137 82 84 96 110 99 122 124 49 111 ...
 $ Margin              : int  69 18 -56 46 19 5 50 69 -26 29 ...
 $ WinLoss             : Factor w/ 2 levels "0","1": 2 2 1 2 2 2 2 2 1 2 ...
 $ Opposition          : Factor w/ 18 levels "Adelaide","Brisbane Lions",..: 8 18 10 9 13 16 7 3 4 6 ...
 $ Venue               : Factor w/ 19 levels "Adelaide Oval",..: 4 7 10 7 7 13 7 6 7 15 ...
 $ Disposals           : int  406 360 304 370 359 362 365 345 324 351 ...
 $ Kicks               : int  252 215 170 225 221 218 224 230 205 215 ...
 $ Marks               : int  109 102 52 41 95 78 93 110 69 85 ...
 $ Handballs           : int  154 145 134 145 138 144 141 115 119 136 ...
 $ Goals               : int  19 11 12 13 16 15 19 19 6 17 ...
 $ Behinds             : int  19 14 9 16 11 6 7 9 12 6 ...
 $ Hitouts             : int  42 41 34 47 45 70 48 54 46 34 ...
 $ Tackles             : int  73 53 51 76 65 63 65 67 77 58 ...
 $ Rebound50s          : int  28 34 23 24 32 48 39 31 34 29 ...
 $ Inside50s           : int  73 49 49 56 61 45 47 50 49 48 ...
 $ Clearances          : int  39 33 38 52 37 43 43 48 37 52 ...
 $ Clangers            : int  47 38 44 62 49 46 32 24 31 41 ...
 $ FreesFor            : int  15 14 15 18 17 15 19 14 18 20 ...
 $ ContendedPossessions: int  152 141 149 192 138 164 148 151 160 155 ...
 $ ContestedMarks      : int  10 16 11 3 12 12 17 14 15 11 ...
 $ MarksInside50       : int  16 13 10 8 12 9 14 13 6 12 ...
 $ OnePercenters       : int  42 54 30 58 24 56 32 53 50 57 ...
 $ Bounces             : int  1 6 4 4 1 7 11 14 0 4 ...
 $ GoalAssists         : int  15 6 9 10 9 12 13 14 5 14 ...

这是我要适应的问题:

ad.glm.all <- glm(WinLoss ~ factor(Team) + Season  + Round + Score  + Margin + Opposition + Venue + Disposals + Kicks + Marks + Handballs + Goals + Behinds + Hitouts + Tackles + Rebound50s + Inside50s+ Clearances+ Clangers+ FreesFor + ContendedPossessions + ContestedMarks + MarksInside50 + OnePercenters + Bounces+GoalAssists, 
                  data = ad.train, family = binomial(logit))

我知道有很多变量(计划是通过选择正向变量来减少变量).但是,即使知道很多变量,它们要么是int要么是Factor;据我了解,这应该只是一闪而过.但是,每次我尝试拟合此模型时,我都会得到:

I know it's a lot of variables (plan is to reduce via forward variable selection). But even know it's a lot of variables they're either int or Factor; which as I understand things should just work with a glm. However, every time I try to fit this model I get:

Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : contrasts can be applied only to factors with 2 or more levels

在我看来,R似乎出于某种原因未将我的Factor变量视为Factor变量?

Which sort of looks to me as if R isn't treating my Factor variables as Factor variables for some reason?

甚至很简单:

ad.glm.test <- glm(WinLoss ~ factor(Team), data = ad.train, family = binomial(logit))

不起作用! (相同的错误消息)

isn't working! (same error message)

原因:

ad.glm.test <- glm(WinLoss ~ Clearances, data = ad.train, family = binomial(logit))

会工作!

有人知道这是怎么回事吗?为什么我无法将这些Factor变量拟合为我的glm?

Anyone know what's going on here? Why can't I fit these Factor variables to my glm??

提前谢谢!

特洛伊

推荐答案

简介

什么是对比度错误"已得到很好的解释:您的因素只有一个(或更少)一个级别.但是实际上,这个简单的事实很容易被掩盖,因为实际用于模型拟合的数据可能与您传递的数据完全不同.这种情况发生在您的数据中包含NA时,您已经对数据进行了子集化,某个因子有未使用的水平,或者您已经转换了变量并在某处获取NaN. 在这种理想情况下,很少有人可以直接从str(your_data_frame)中发现单级因子.关于StackOverflow的许多问题关于此错误的信息不可复制,因此人们的建议可能会或可能不会奏效.因此,尽管目前为止

Introduction

What a "contrasts error" is has been well explained: you have a factor that only has one level (or less). But in reality this simple fact can be easily obscured because the data that are actually used for model fitting can be very different from what you've passed in. This happens when you have NA in your data, you've subsetted your data, a factor has unused levels, or you've transformed your variables and get NaN somewhere. You are rarely in this ideal situation where a single-level factor can be spotted from str(your_data_frame) directly. Many questions on StackOverflow regarding this error are not reproducible, thus suggestions by people may or may not work. Therefore, although there are by now 118 posts regarding this issue, users still can't find an adaptive solution so that this question is raised again and again. This answer is my attempt, to solve this matter "once for all", or at least to provide a reasonable guide.

此答案包含丰富的信息,所以让我先做一个简短的总结.

This answer has rich information, so let me first make a quick summary.

我为您定义了3个辅助函数:debug_contr_errordebug_contr_error2NA_preproc.

I defined 3 helper functions for you: debug_contr_error, debug_contr_error2, NA_preproc.

我建议您以以下方式使用它们.

I recommend you use them in the following way.

  1. 运行NA_preproc以获得更完整的案例;
  2. 运行模型,如果出现对比度错误",请使用debug_contr_error2进行调试.
  1. run NA_preproc to get more complete cases;
  2. run your model, and if you get a "contrasts error", use debug_contr_error2 for debugging.

大多数答案都逐步向您展示了&为什么要定义这些功能.跳过那些开发过程可能不会有任何危害,但是不要跳过可复制的案例研究和讨论"中的部分.

Most of the answer shows you step by step how & why these functions are defined. There is probably no harm to skip those development process, but don't skip sections from "Reproducible case studies and Discussions".

原始答案 已成功帮助了其他人.但是在其他地方由于缺乏适应性而失败 .查看问题中str(ad.train)的输出. OP的变量是数字或因子;没有字符.最初的答案就是针对这种情况.如果您有字符变量,尽管在lmglm拟合期间它们将被强制转换为因数,但是由于未将它们作为因素提供,因此代码不会报告它们,因此is.factor会丢失它们.在此扩展中,我将使原始答案都更具适应性.

The original answer works perfectly for OP, and has successfully helped some others. But it had failed somewhere else for lack of adaptiveness. Look at the output of str(ad.train) in the question. OP's variables are numeric or factors; there are no characters. The original answer was for this situation. If you have character variables, although they will be coerced to factors during lm and glm fitting, they won't be reported by the code since they were not provided as factors so is.factor will miss them. In this expansion I will make the original answer both more adaptive.

dat是您传递给lmglm的数据集.如果没有这样的数据框架,也就是说,所有变量都分散在全局环境中,则需要将它们收集到一个数据框架中.以下可能不是最好的方法,但它可以工作.

Let dat be your dataset passed to lm or glm. If you don't readily have such a data frame, that is, all your variables are scattered in the global environment, you need to gather them into a data frame. The following may not be the best way but it works.

## `form` is your model formula, here is an example
y <- x1 <- x2 <- x3 <- 1:4
x4 <- matrix(1:8, 4)
form <- y ~ bs(x1) + poly(x2) + I(1 / x3) + x4

## to gather variables `model.frame.default(form)` is the easiest way 
## but it does too much: it drops `NA` and transforms variables
## we want something more primitive

## first get variable names
vn <- all.vars(form)
#[1] "y"  "x1" "x2" "x3" "x4"

## `get_all_vars(form)` gets you a data frame
## but it is buggy for matrix variables so don't use it
## instead, first use `mget` to gather variables into a list
lst <- mget(vn)

## don't do `data.frame(lst)`; it is buggy with matrix variables
## need to first protect matrix variables by `I()` then do `data.frame`
lst_protect <- lapply(lst, function (x) if (is.matrix(x)) I(x) else x)
dat <- data.frame(lst_protect)
str(dat)
#'data.frame':  4 obs. of  5 variables:
# $ y : int  1 2 3 4
# $ x1: int  1 2 3 4
# $ x2: int  1 2 3 4
# $ x3: int  1 2 3 4
# $ x4: 'AsIs' int [1:4, 1:2] 1 2 3 4 5 6 7 8

## note the 'AsIs' for matrix variable `x4`
## in comparison, try the following buggy ones yourself
str(get_all_vars(form))
str(data.frame(lst))

第0步:显式子设置

如果已使用lmglmsubset参数,请从显式子设置开始:

If you've used the subset argument of lm or glm, start by an explicit subsetting:

## `subset_vec` is what you pass to `lm` via `subset` argument
## it can either be a logical vector of length `nrow(dat)`
## or a shorter positive integer vector giving position index
## note however, `base::subset` expects logical vector for `subset` argument
## so a rigorous check is necessary here
if (mode(subset_vec) == "logical") {
  if (length(subset_vec) != nrow(dat)) {
    stop("'logical' `subset_vec` provided but length does not match `nrow(dat)`")
    }
  subset_log_vec <- subset_vec
  } else if (mode(subset_vec) == "numeric") {
  ## check range
  ran <- range(subset_vec)
  if (ran[1] < 1 || ran[2] > nrow(dat)) {
    stop("'numeric' `subset_vec` provided but values are out of bound")
    } else {
    subset_log_vec <- logical(nrow(dat))
    subset_log_vec[as.integer(subset_vec)] <- TRUE
    } 
  } else {
  stop("`subset_vec` must be either 'logical' or 'numeric'")
  }
dat <- base::subset(dat, subset = subset_log_vec)

第1步:删除不完整的案件

dat <- na.omit(dat)

如果您已完成第0步,则可以跳过此步骤,因为 subset会自动删除不完整的案件.

You can skip this step if you've gone through step 0, since subset automatically removes incomplete cases.

第2步:模式检查和转换

数据框列通常是原子向量,具有以下的 mode :逻辑",数字",复杂",字符",原始".对于回归,对不同模式的变量的处理方式不同.

A data frame column is usually an atomic vector, with a mode from the following: "logical", "numeric", "complex", "character", "raw". For regression, variables of different modes are handled differently.

"logical",   it depends
"numeric",   nothing to do
"complex",   not allowed by `model.matrix`, though allowed by `model.frame`
"character", converted to "numeric" with "factor" class by `model.matrix`
"raw",       not allowed by `model.matrix`, though allowed by `model.frame`

逻辑变量很棘手.可以将其视为虚拟变量(对于TRUE来说是1;对于FALSE来说是0),因此是一个数字",或者可以被强制为一个两级因子.这完全取决于model.matrix是否从模型公式的规范中认为强制"强制是必要的.为简单起见,我们可以这样理解:它总是被强制转换为一个因子,但是施加对比的结果可能最终得到相同的模型矩阵,就像直接将其当作虚拟对象一样.

A logical variable is tricky. It can either be treated as a dummy variable (1 for TRUE; 0 for FALSE) hence a "numeric", or it can be coerced to a two-level factor. It all depends on whether model.matrix thinks a "to-factor" coercion is necessary from the specification of your model formula. For simplicity we can understand it as such: it is always coerced to a factor, but the result of applying contrasts may end up with the same model matrix as if it were handled as a dummy directly.

某些人可能想知道为什么不包括整数".因为像1:4这样的整数矢量具有数字"模式(请尝试mode(1:4)).

Some people may wonder why "integer" is not included. Because an integer vector, like 1:4, has a "numeric" mode (try mode(1:4)).

数据帧列也可以是具有"AsIs"类的矩阵,但是这种矩阵必须具有数字"模式.

A data frame column may also be a matrix with "AsIs" class, but such a matrix must have "numeric" mode.

我们的检查是在以下情况下产生错误

Our checking is to produce error when

  • 发现复杂"或原始";
  • 找到一个逻辑"或字符"矩阵变量;

并继续将逻辑"和字符"转换为因子"类的数字".

and proceed to convert "logical" and "character" to "numeric" of "factor" class.

## get mode of all vars
var_mode <- sapply(dat, mode)

## produce error if complex or raw is found
if (any(var_mode %in% c("complex", "raw"))) stop("complex or raw not allowed!")

## get class of all vars
var_class <- sapply(dat, class)

## produce error if an "AsIs" object has "logical" or "character" mode
if (any(var_mode[var_class == "AsIs"] %in% c("logical", "character"))) {
  stop("matrix variables with 'AsIs' class must be 'numeric'")
  }

## identify columns that needs be coerced to factors
ind1 <- which(var_mode %in% c("logical", "character"))

## coerce logical / character to factor with `as.factor`
dat[ind1] <- lapply(dat[ind1], as.factor)

请注意,如果数据框列已经是一个因子变量,则它将不包含在ind1中,因为因子变量具有数字"模式(请尝试mode(factor(letters[1:4]))).

Note that if a data frame column is already a factor variable, it will not be included in ind1, as a factor variable has "numeric" mode (try mode(factor(letters[1:4]))).

第3步:降低未使用的因子水平

对于第2步转换的因子变量,即用ind1索引的因子变量,我们将没有未使用的因子水平.但是,dat随附的因子变量可能具有未使用的级别(通常是步骤0和步骤1的结果).我们需要从中删除任何可能的未使用级别.

We won't have unused factor levels for factor variables converted from step 2, i.e., those indexed by ind1. However, factor variables that come with dat might have unused levels (often as the result of step 0 and step 1). We need to drop any possible unused levels from them.

## index of factor columns
fctr <- which(sapply(dat, is.factor))

## factor variables that have skipped explicit conversion in step 2
## don't simply do `ind2 <- fctr[-ind1]`; buggy if `ind1` is `integer(0)`
ind2 <- if (length(ind1) > 0L) fctr[-ind1] else fctr

## drop unused levels
dat[ind2] <- lapply(dat[ind2], droplevels)

第4步:汇总因子变量

现在我们准备好查看lmglm实际使用了哪些因子水平以及多少因子水平:

Now we are ready to see what and how many factor levels are actually used by lm or glm:

## export factor levels actually used by `lm` and `glm`
lev <- lapply(dat[fctr], levels)

## count number of levels
nl <- lengths(lev)


为了使您的生活更轻松,我将这些步骤打包为一个函数debug_contr_error.

输入:

  • dat是通过data参数传递给lmglm的数据帧;
  • subset_vec是通过subset参数传递给lmglm的索引向量.
  • dat is your data frame passed to lm or glm via data argument;
  • subset_vec is the index vector passed to lm or glm via subset argument.

输出:带有

  • nlevels(列表)给出了所有因子变量的因子水平数;
  • levels(向量)给出所有因子变量的水平.
  • nlevels (a list) gives the number of factor levels for all factor variables;
  • levels (a vector) gives levels for all factor variables.

如果没有完整的案例或没有要汇总的因子变量,该函数将产生警告.

The function produces a warning, if there are no complete cases or no factor variables to summarize.

debug_contr_error <- function (dat, subset_vec = NULL) {
  if (!is.null(subset_vec)) {
    ## step 0
    if (mode(subset_vec) == "logical") {
      if (length(subset_vec) != nrow(dat)) {
        stop("'logical' `subset_vec` provided but length does not match `nrow(dat)`")
        }
      subset_log_vec <- subset_vec
      } else if (mode(subset_vec) == "numeric") {
      ## check range
      ran <- range(subset_vec)
      if (ran[1] < 1 || ran[2] > nrow(dat)) {
        stop("'numeric' `subset_vec` provided but values are out of bound")
        } else {
        subset_log_vec <- logical(nrow(dat))
        subset_log_vec[as.integer(subset_vec)] <- TRUE
        } 
      } else {
      stop("`subset_vec` must be either 'logical' or 'numeric'")
      }
    dat <- base::subset(dat, subset = subset_log_vec)
    } else {
    ## step 1
    dat <- stats::na.omit(dat)
    }
  if (nrow(dat) == 0L) warning("no complete cases")
  ## step 2
  var_mode <- sapply(dat, mode)
  if (any(var_mode %in% c("complex", "raw"))) stop("complex or raw not allowed!")
  var_class <- sapply(dat, class)
  if (any(var_mode[var_class == "AsIs"] %in% c("logical", "character"))) {
    stop("matrix variables with 'AsIs' class must be 'numeric'")
    }
  ind1 <- which(var_mode %in% c("logical", "character"))
  dat[ind1] <- lapply(dat[ind1], as.factor)
  ## step 3
  fctr <- which(sapply(dat, is.factor))
  if (length(fctr) == 0L) warning("no factor variables to summary")
  ind2 <- if (length(ind1) > 0L) fctr[-ind1] else fctr
  dat[ind2] <- lapply(dat[ind2], base::droplevels.factor)
  ## step 4
  lev <- lapply(dat[fctr], base::levels.default)
  nl <- lengths(lev)
  ## return
  list(nlevels = nl, levels = lev)
  }

这是一个虚构的小例子.

Here is a constructed tiny example.

dat <- data.frame(y = 1:4,
                  x = c(1:3, NA),
                  f1 = gl(2, 2, labels = letters[1:2]),
                  f2 = c("A", "A", "A", "B"),
                  stringsAsFactors = FALSE)

#  y  x f1 f2
#1 1  1  a  A
#2 2  2  a  A
#3 3  3  b  A
#4 4 NA  b  B

str(dat)
#'data.frame':  4 obs. of  4 variables:
# $ y : int  1 2 3 4
# $ x : int  1 2 3 NA
# $ f1: Factor w/ 2 levels "a","b": 1 1 2 2
# $ f2: chr  "A" "A" "A" "B"

lm(y ~ x + f1 + f2, dat)
#Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : 
#  contrasts can be applied only to factors with 2 or more levels

很好,我们看到一个错误.现在,我的debug_contr_error公开了f2以单个级别结束.

Good, we see an error. Now my debug_contr_error exposes that f2 ends up with a single level.

debug_contr_error(dat)
#$nlevels
#f1 f2 
# 2  1 
#
#$levels
#$levels$f1
#[1] "a" "b"
#
#$levels$f2
#[1] "A"

请注意,最初的简短答案在这里是没有希望的,因为f2是作为字符变量而不是因子变量提供的.

Note that the original short answer is hopeless here, as f2 is provided as a character variable not a factor variable.

## old answer
tmp <- na.omit(dat)
fctr <- lapply(tmp[sapply(tmp, is.factor)], droplevels)
sapply(fctr, nlevels)
#f1 
# 2 
rm(tmp, fctr)

现在让我们看一个带有矩阵变量x的示例.

Now let's see an example with a matrix variable x.

dat <- data.frame(X = I(rbind(matrix(1:6, 3), NA)),
                  f = c("a", "a", "a", "b"),
                  y = 1:4)

dat
#  X.1 X.2 f y
#1   1   4 a 1
#2   2   5 a 2
#3   3   6 a 3
#4  NA  NA b 4

str(dat)
#'data.frame':  4 obs. of  3 variables:
# $ X: 'AsIs' int [1:4, 1:2] 1 2 3 NA 4 5 6 NA
# $ f: Factor w/ 2 levels "a","b": 1 1 1 2
# $ y: int  1 2 3 4

lm(y ~ X + f, data = dat)
#Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : 
#  contrasts can be applied only to factors with 2 or more levels

debug_contr_error(dat)$nlevels
#f 
#1

请注意,没有水平的因子变量也可能导致对比度误差".您可能想知道0级因子是如何可能的.好吧,这是合法的:nlevels(factor(character(0))).如果没有完整的案例,在这里您将得到0级因子.

Note that a factor variable with no levels can cause an "contrasts error", too. You may wonder how a 0-level factor is possible. Well it is legitimate: nlevels(factor(character(0))). Here you will end up with a 0-level factors if you have no complete cases.

dat <- data.frame(y = 1:4,
                  x = rep(NA_real_, 4),
                  f1 = gl(2, 2, labels = letters[1:2]),
                  f2 = c("A", "A", "A", "B"),
                  stringsAsFactors = FALSE)

lm(y ~ x + f1 + f2, dat)
#Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : 
#  contrasts can be applied only to factors with 2 or more levels

debug_contr_error(dat)$nlevels
#f1 f2 
# 0  0    ## all values are 0
#Warning message:
#In debug_contr_error(dat) : no complete cases

最后,让我们看一下f2是逻辑变量的情况.

Finally let's see some a situation where if f2 is a logical variable.

dat <- data.frame(y = 1:4,
                  x = c(1:3, NA),
                  f1 = gl(2, 2, labels = letters[1:2]),
                  f2 = c(TRUE, TRUE, TRUE, FALSE))

dat
#  y  x f1    f2
#1 1  1  a  TRUE
#2 2  2  a  TRUE
#3 3  3  b  TRUE
#4 4 NA  b FALSE

str(dat)
#'data.frame':  4 obs. of  4 variables:
# $ y : int  1 2 3 4
# $ x : int  1 2 3 NA
# $ f1: Factor w/ 2 levels "a","b": 1 1 2 2
# $ f2: logi  TRUE TRUE TRUE FALSE

我们的调试器将预测对比度错误",但是它真的会发生吗?

Our debugger will predict a "contrasts error", but will it really happen?

debug_contr_error(dat)$nlevels
#f1 f2 
# 2  1 

否,至少这一点不会失败( NA系数是由于模型的秩不足;请不要不用担心):

No, at least this one does not fail (the NA coefficient is due to the rank-deficiency of the model; don't worry):

lm(y ~ x + f1 + f2, data = dat)
#Coefficients:
#(Intercept)            x          f1b       f2TRUE  
#          0            1            0           NA

我很难举一个给出错误的例子,但也没有必要.实际上,我们不使用调试器进行预测.我们在遇到错误时才使用它;在这种情况下,调试器就可以找到有问题的因素变量.

It is difficult for me to come up with an example giving an error, but there is also no need. In practice, we don't use the debugger for prediction; we use it when we really get an error; and in that case, the debugger can locate the offending factor variable.

也许有人可能会认为逻辑变量与虚拟变量没有什么不同.但是,请尝试下面的简单示例:它确实取决于您的公式.

Perhaps some may argue that a logical variable is no different to a dummy. But try the simple example below: it does depends on your formula.

u <- c(TRUE, TRUE, FALSE, FALSE)
v <- c(1, 1, 0, 0)  ## "numeric" dummy of `u`

model.matrix(~ u)
#  (Intercept) uTRUE
#1           1     1
#2           1     1
#3           1     0
#4           1     0

model.matrix(~ v)
#  (Intercept) v
#1           1 1
#2           1 1
#3           1 0
#4           1 0

model.matrix(~ u - 1)
#  uFALSE uTRUE
#1      0     1
#2      0     1
#3      1     0
#4      1     0

model.matrix(~ v - 1)
#  v
#1 1
#2 1
#3 0
#4 0


使用lm

"model.frame"方法实现更灵活的实现

还建议您通过 R:如何调试因素具有新水平"线性模型和预测的误差,它说明了lmglm在数据集中的作用.您将了解,上面列出的步骤0到4只是试图模仿这种内部过程.请记住,实际用于模型拟合的数据可能与您传递的数据完全不同.


More flexible implementation using "model.frame" method of lm

You are also advised to go through R: how to debug "factor has new levels" error for linear model and prediction, which explains what lm and glm do under the hood on your dataset. You will understand that steps 0 to 4 listed above are just trying to mimic such internal process. Remember, the data that are actually used for model fitting can be very different from what you've passed in.

我们的步骤与此类内部处理并不完全一致.为了进行比较,可以使用lmglm中的method = "model.frame"检索内部处理的结果.在先前构造的小示例dat中尝试此操作,其中f2是字符变量.

Our steps are not completely consistent with such internal processing. For a comparison, you can retrieve the result of the internal processing by using method = "model.frame" in lm and glm. Try this on the previously constructed tiny example dat where f2 is a character variable.

dat_internal <- lm(y ~ x + f1 + f2, dat, method = "model.frame")

dat_internal
#  y x f1 f2
#1 1 1  a  A
#2 2 2  a  A
#3 3 3  b  A

str(dat_internal)
#'data.frame':  3 obs. of  4 variables:
# $ y : int  1 2 3
# $ x : int  1 2 3
# $ f1: Factor w/ 2 levels "a","b": 1 1 2
# $ f2: chr  "A" "A" "A"
## [.."terms" attribute is truncated..]

在实践中,model.frame将仅执行步骤0和步骤1.它还会删除数据集中提供的变量,而不是模型公式中提供的变量.因此,模型框架的行数和列数都可能少于您喂lmglm的行数和列数.像我们在步骤2中所做的那样,类型强制是由后面的model.matrix完成的,可能会产生对比度错误".

In practice, model.frame will only perform step 0 and step 1. It also drops variables provided in your dataset but not in your model formula. So a model frame may have both fewer rows and columns than what you feed lm and glm. Type coercing as done in our step 2 is done by the later model.matrix where a "contrasts error" may be produced.

首先获得此内部模型框架,然后将其传递给debug_contr_error(因此,它实际上仅执行步骤2至4)有很多优点.

There are a few advantages to first get this internal model frame, then pass it to debug_contr_error (so that it only essentially performs steps 2 to 4).

优点1:模型公式中未使用的变量将被忽略

## no variable `f1` in formula
dat_internal <- lm(y ~ x + f2, dat, method = "model.frame")

## compare the following
debug_contr_error(dat)$nlevels
#f1 f2 
# 2  1 

debug_contr_error(dat_internal)$nlevels
#f2 
# 1 

优势2:能够应对转换后的变量

在模型公式中转换变量是有效的,并且model.frame将记录转换后的变量而不是原始变量.请注意,即使您的原始变量没有NA,转换后的变量也可以拥有.

It is valid to transform variables in the model formula, and model.frame will record the transformed ones instead of the original ones. Note that, even if your original variable has no NA, the transformed one can have.

dat <- data.frame(y = 1:4, x = c(1:3, -1), f = rep(letters[1:2], c(3, 1)))
#  y  x f
#1 1  1 a
#2 2  2 a
#3 3  3 a
#4 4 -1 b

lm(y ~ log(x) + f, data = dat)
#Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : 
#  contrasts can be applied only to factors with 2 or more levels
#In addition: Warning message:
#In log(x) : NaNs produced

# directly using `debug_contr_error` is hopeless here
debug_contr_error(dat)$nlevels
#f 
#2 

## this works
dat_internal <- lm(y ~ log(x) + f, data = dat, method = "model.frame")
#  y    log(x) f
#1 1 0.0000000 a
#2 2 0.6931472 a
#3 3 1.0986123 a

debug_contr_error(dat_internal)$nlevels
#f 
#1

鉴于这些好处,我编写了另一个包装model.framedebug_contr_error的函数.

Given these benefits, I write another function wrapping up model.frame and debug_contr_error.

输入:

  • form是您的模型公式;
  • dat是通过data参数传递给lmglm的数据集;
  • subset_vec是通过subset参数传递给lmglm的索引向量.
  • form is your model formula;
  • dat is the dataset passed to lm or glm via data argument;
  • subset_vec is the index vector passed to lm or glm via subset argument.

输出:带有

  • mf(数据框)给出了模型框(删除了"terms"属性);
  • nlevels(列表)给出了所有因子变量的因子水平数;
  • levels(向量)给出所有因子变量的水平.
  • mf (a data frame) gives the model frame (with "terms" attribute dropped);
  • nlevels (a list) gives the number of factor levels for all factor variables;
  • levels (a vector) gives levels for all factor variables.

## note: this function relies on `debug_contr_error`
debug_contr_error2 <- function (form, dat, subset_vec = NULL) {
  ## step 0
  if (!is.null(subset_vec)) {
    if (mode(subset_vec) == "logical") {
      if (length(subset_vec) != nrow(dat)) {
        stop("'logical' `subset_vec` provided but length does not match `nrow(dat)`")
        }
      subset_log_vec <- subset_vec
      } else if (mode(subset_vec) == "numeric") {
      ## check range
      ran <- range(subset_vec)
      if (ran[1] < 1 || ran[2] > nrow(dat)) {
        stop("'numeric' `subset_vec` provided but values are out of bound")
        } else {
        subset_log_vec <- logical(nrow(dat))
        subset_log_vec[as.integer(subset_vec)] <- TRUE
        } 
      } else {
      stop("`subset_vec` must be either 'logical' or 'numeric'")
      }
    dat <- base::subset(dat, subset = subset_log_vec)
    }
  ## step 0 and 1
  dat_internal <- stats::lm(form, data = dat, method = "model.frame")
  attr(dat_internal, "terms") <- NULL
  ## rely on `debug_contr_error` for steps 2 to 4
  c(list(mf = dat_internal), debug_contr_error(dat_internal, NULL))
  }

尝试前面的log转换示例.

debug_contr_error2(y ~ log(x) + f, dat)
#$mf
#  y    log(x) f
#1 1 0.0000000 a
#2 2 0.6931472 a
#3 3 1.0986123 a
#
#$nlevels
#f 
#1 
#
#$levels
#$levels$f
#[1] "a"
#
#
#Warning message:
#In log(x) : NaNs produced

也尝试subset_vec.

## or: debug_contr_error2(y ~ log(x) + f, dat, c(T, F, T, T))
debug_contr_error2(y ~ log(x) + f, dat, c(1,3,4))
#$mf
#  y   log(x) f
#1 1 0.000000 a
#3 3 1.098612 a
#
#$nlevels
#f 
#1 
#
#$levels
#$levels$f
#[1] "a"
#
#
#Warning message:
#In log(x) : NaNs produced


每组的模型拟合和NA作为因子水平

如果按组拟合模型,则更有可能出现对比度误差".


Model fitting per group and NA as factor levels

If you are fitting model by group, you are more likely to get a "contrasts error". You need to

  1. 通过分组变量分割数据框(请参见?split.data.frame);
  2. 应用debug_contr_error2逐个处理这些数据帧(lapply函数可有助于完成此循环).
  1. split your data frame by the grouping variable (see ?split.data.frame);
  2. work through those data frames one by one, applying debug_contr_error2 (lapply function can be helpful to do this loop).

有人还告诉我,他们不能在数据上使用na.omit,因为它将导致行太少而无法执行任何操作明智的.这可以放松.实际上,必须省略NA_integer_NA_real_,但是可以保留NA_character_:只需添加NA作为因子级别.为此,您需要遍历数据框中的变量:

Some also told me that they can not use na.omit on their data, because it will end up too few rows to do anything sensible. This can be relaxed. In practice it is the NA_integer_ and NA_real_ that have to be omitted, but NA_character_ can be retained: just add NA as a factor level. To achieve this, you need to loop through variables in your data frame:

  • 如果变量x已经是因子并且anyNA(x)TRUE ,请执行x <- addNA(x). 和"很重要.如果x没有NA,则addNA(x)将添加未使用的<NA>级别.
  • 如果变量x是字符,请执行x <- factor(x, exclude = NULL)将其强制为一个因子. exclude = NULL将保留<NA>作为级别.
  • 如果x是逻辑",数字",原始"或复杂",则不应进行任何更改. NA就是NA.
  • if a variable x is already a factor and anyNA(x) is TRUE, do x <- addNA(x). The "and" is important. If x has no NA, addNA(x) will add an unused <NA> level.
  • if a variable x is a character, do x <- factor(x, exclude = NULL) to coerce it to a factor. exclude = NULL will retain <NA> as a level.
  • if x is "logical", "numeric", "raw" or "complex", nothing should be changed. NA is just NA.

<NA>因子级别不会因droplevelsna.omit而下降,并且对于构建模型矩阵有效.检查以下示例.

<NA> factor level will not be dropped by droplevels or na.omit, and it is valid for building a model matrix. Check the following examples.

## x is a factor with NA

x <- factor(c(letters[1:4], NA))  ## default: `exclude = NA`
#[1] a    b    c    d    <NA>     ## there is an NA value
#Levels: a b c d                  ## but NA is not a level

na.omit(x)  ## NA is gone
#[1] a b c d
#[.. attributes truncated..]
#Levels: a b c d

x <- addNA(x)  ## now add NA into a valid level
#[1] a    b    c    d    <NA>
#Levels: a b c d <NA>  ## it appears here

droplevels(x)    ## it can not be dropped
#[1] a    b    c    d    <NA>
#Levels: a b c d <NA>

na.omit(x)  ## it is not omitted
#[1] a    b    c    d    <NA>
#Levels: a b c d <NA>

model.matrix(~ x)   ## and it is valid to be in a design matrix
#  (Intercept) xb xc xd xNA
#1           1  0  0  0   0
#2           1  1  0  0   0
#3           1  0  1  0   0
#4           1  0  0  1   0
#5           1  0  0  0   1

## x is a character with NA

x <- c(letters[1:4], NA)
#[1] "a" "b" "c" "d" NA 

as.factor(x)  ## this calls `factor(x)` with default `exclude = NA`
#[1] a    b    c    d    <NA>     ## there is an NA value
#Levels: a b c d                  ## but NA is not a level

factor(x, exclude = NULL)      ## we want `exclude = NULL`
#[1] a    b    c    d    <NA>
#Levels: a b c d <NA>          ## now NA is a level

一旦在因子/字符中添加NA作为级别,您的数据集可能突然有了更完整的案例.然后,您可以运行模型.如果仍然出现对比度错误",请使用debug_contr_error2查看发生了什么.

Once you add NA as a level in a factor / character, your dataset might suddenly have more complete cases. Then you can run your model. If you still get a "contrasts error", use debug_contr_error2 to see what has happened.

为方便起见,我为此NA预处理编写了一个函数.

For your convenience, I write a function for this NA preprocessing.

输入:

  • dat是您的完整数据集.
  • dat is your full dataset.

输出:

  • 一个数据帧,添加了NA作为因子/字符的级别.

NA_preproc <- function (dat) {
  for (j in 1:ncol(dat)) {
    x <- dat[[j]]
    if (is.factor(x) && anyNA(x)) dat[[j]] <- base::addNA(x)
    if (is.character(x)) dat[[j]] <- factor(x, exclude = NULL)
    }
  dat
  }


可重复的案例研究和讨论

以下内容是为可重复的案例研究专门选择的,因为我刚刚使用此处创建的三个辅助功能对其进行了回答.


Reproducible case studies and Discussions

The followings are specially selected for reproducible case studies, as I just answered them with the three helper functions created here.

  • How to do a GLM when "contrasts can be applied only to factors with 2 or more levels"?
  • R: Error in contrasts when fitting linear models with `lm`

其他StackOverflow用户还解决了其他一些优质线程:

There are also a few other good-quality threads solved by other StackOverflow users:

  • Factors not being recognised in a lm using map() (this is about model fitting by group)
  • How to drop NA observation of factors conditionally when doing linear regression in R? (this is similar to case 1 in the previous list)
  • Factor/level error in mixed model (another post about model fitting by group)

此答案旨在调试模型拟合期间的对比度错误".但是,使用predict进行预测时也会出现此错误.这种行为不是predict.lmpredict.glm的情况,而是某些软件包中的预测方法.这是StackOverflow上的一些相关线程.

This answer aims to debug the "contrasts error" during model fitting. However, this error can also turn up when using predict for prediction. Such behavior is not with predict.lm or predict.glm, but with predict methods from some packages. Here are a few related threads on StackOverflow.

  • Prediction in R - GLMM
  • Error in `contrasts' Error
  • SVM predict on dataframe with different factor levels
  • Using predict with svyglm
  • must a dataset contain all factors in SVM in R
  • Probability predictions with cumulative link mixed models
  • must a dataset contain all factors in SVM in R

还请注意,此答案的原理基于lmglm的原理. 这两个函数是许多模型拟合例程的编码标准,但也许并非所有模型拟合例程的行为都相似.例如,对于我的助手功能是否真正有用,以下内容对我而言并不透明.

Also note that the philosophy of this answer is based on that of lm and glm. These two functions are a coding standard for many model fitting routines, but maybe not all model fitting routines behave similarly. For example, the following does not look transparent to me whether my helper functions would actually be helpful.

  • Error with svychisq - 'contrast can be applied to factors with 2 or more levels'
  • R packages effects & plm : "error in contrasts" when trying to plot marginal effects
  • Contrasts can be applied only to factor
  • R: lawstat::levene.test fails while Fligner Killeen works, as well as car::leveneTest
  • R - geeglm Error: contrasts can be applied only to factors with 2 or more levels

尽管有些偏离主题,但知道有时对比度错误"仅源于编写错误的代码,仍然很有用.在以下示例中,OP将变量的名称而不是其值传递给lm.由于名称是单值字符,因此以后会被强制为单级因子并导致错误.

Although a bit off-topic, it is still useful to know that sometimes a "contrasts error" merely comes from writing a wrong piece of code. In the following examples, OP passed the name of their variables rather than their values to lm. Since a name is a single value character, it is later coerced to a single-level factor and causes the error.

  • Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : contrasts can be applied only to factors with 2 or more levels
  • Loop through a character vector to use in a function

实际上,人们想知道如何在统计级别或编程级别解决此问题.

In practice people want to know how to resolve this matter, either at a statistical level or a programming level.

如果要在完整的数据集上拟合模型,那么除非您可以估算缺失值或收集更多数据,否则可能没有统计解决方案.因此,您可以简单地求助于编码解决方案以删除有问题的变量. debug_contr_error2返回nlevels,可帮助您轻松找到它们.如果您不想删除它们,则将其替换为1的向量(如如何在对比"时执行GLM中所述只能应用于具有2个或更多级别的因子"?),并让lmglm处理所产生的等级缺陷.

If you are fitting models on your complete dataset, then there is probably no statistical solution, unless you can impute missing values or collect more data. Thus you may simply turn to a coding solution to drop the offending variable. debug_contr_error2 returns nlevels which helps you easily locate them. If you don't want to drop them, replace them by a vector of 1 (as explained in How to do a GLM when "contrasts can be applied only to factors with 2 or more levels"?) and let lm or glm deal with the resulting rank-deficiency.

如果要在子集中拟合模型,则可以有统计解决方案.

If you are fitting models on subset, there can be statistical solutions.

按组拟合模型并不一定要求您按组划分数据集并拟合独立模型.以下内容可能会给您一个大概的想法:

Fitting models by group does not necessarily require you splitting your dataset by group and fitting independent models. The following may give you a rough idea:

  • R regression analysis: analyzing data for a certain ethnicity
  • Finding the slope for multiple points in selected columns
  • R: build separate models for each category

如果确实进行数据拆分,则很容易出现对比度误差",因此必须按组调整模型公式(即,需要动态生成模型公式).一个更简单的解决方案是跳过为此组建立模型.

If you do split your data explicitly, you can easily get "contrasts error", thus have to adjust your model formula per group (that is, you need to dynamically generate model formulae). A simpler solution is to skip building a model for this group.

您还可以将数据集随机划分为训练子集和测试子集,以便进行交叉验证. R:如何调试因素具有新水平"线性模型和预测的误差简要地提到了这一点,您最好进行分层抽样,以确保训练部分的模型估计和测试部分的预测都成功.

You may also randomly partition your dataset into a training subset and a testing subset so that you can do cross-validation. R: how to debug "factor has new levels" error for linear model and prediction briefly mentions this, and you'd better do a stratified sampling to ensure the success of both model estimation on the training part and prediction on the testing part.

这篇关于如何调试“对比度只能应用于具有两个或两个以上级别的因数".错误?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆