R:将线性模型与`lm`拟合时的对比误差 [英] R: Error in contrasts when fitting linear models with `lm`

查看:92
本文介绍了R:将线性模型与`lm`拟合时的对比误差的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我发现定义时形成对比的错误R 中的线性模型并遵循了那里的建议,但是我的因子变量都没有一个值,并且仍然遇到相同的问题.

这是我正在使用的数据集: https://www.dropbox.com/s/em7xphbeaxykgla/train.csv?dl=0 .

这是我要运行的代码:

simplelm <- lm(log_SalePrice ~ ., data = train)

#Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
# contrasts can be applied only to factors with 2 or more levels

出了什么问题?

解决方案

感谢您提供数据集(我希望该链接永远有效,以便每个人都可以访问).我将其读入数据框train.

使用如何调试对比度"提供的debug_contr_errordebug_contr_error2NA_preproc辅助功能.到具有2个或更多水平的因子"错误?,我们可以轻松地分析问题.

info <- debug_contr_error2(log_SalePrice ~ ., train)

## the data frame that is actually used by `lm`
dat <- info$mf

## number of cases in your dataset
nrow(train)
#[1] 1460

## number of complete cases used by `lm`
nrow(dat)
#[1] 1112

## number of levels for all factor variables in `dat`
info$nlevels
#     MSZoning        Street         Alley      LotShape   LandContour 
#            4             2             3             4             4 
#    Utilities     LotConfig     LandSlope  Neighborhood    Condition1 
#            1             5             3            25             9 
#   Condition2      BldgType    HouseStyle     RoofStyle      RoofMatl 
#            6             5             8             5             7 
#  Exterior1st   Exterior2nd    MasVnrType     ExterQual     ExterCond 
#           14            16             4             4             4 
#   Foundation      BsmtQual      BsmtCond  BsmtExposure  BsmtFinType1 
#            6             5             5             5             7 
# BsmtFinType2       Heating     HeatingQC    CentralAir    Electrical 
#            7             5             5             2             5 
#  KitchenQual    Functional   FireplaceQu    GarageType  GarageFinish 
#            4             6             6             6             3 
#   GarageQual    GarageCond    PavedDrive        PoolQC         Fence 
#            5             5             3             4             5 
#  MiscFeature      SaleType SaleCondition  MiscVal_bool      MoYrSold 
#            4             9             6             2            55 

如您所见,Utilities是有问题的变量,因为它只有1级.

由于train中有许多字符/因子变量,所以我想知道您是否为它们提供NA.如果将NA添加为有效级别,则可能会得到更完整的案例.

new_train <- NA_preproc(train)

new_info <- debug_contr_error2(log_SalePrice ~ ., new_train)

new_dat <- new_info$mf

nrow(new_dat)
#[1] 1121

new_info$nlevels
#     MSZoning        Street         Alley      LotShape   LandContour 
#            5             2             3             4             4 
#    Utilities     LotConfig     LandSlope  Neighborhood    Condition1 
#            1             5             3            25             9 
#   Condition2      BldgType    HouseStyle     RoofStyle      RoofMatl 
#            6             5             8             5             7 
#  Exterior1st   Exterior2nd    MasVnrType     ExterQual     ExterCond 
#           14            16             4             4             4 
#   Foundation      BsmtQual      BsmtCond  BsmtExposure  BsmtFinType1 
#            6             5             5             5             7 
# BsmtFinType2       Heating     HeatingQC    CentralAir    Electrical 
#            7             5             5             2             6 
#  KitchenQual    Functional   FireplaceQu    GarageType  GarageFinish 
#            4             6             6             6             3 
#   GarageQual    GarageCond    PavedDrive        PoolQC         Fence 
#            5             5             3             4             5 
#  MiscFeature      SaleType SaleCondition  MiscVal_bool      MoYrSold 
#            4             9             6             2            55

我们确实可以获得更完整的案例,但是Utilities仍然只有一个级别.这意味着大多数不完整的情况实际上是由您的数值变量中的NA引起的,我们无法采取任何措施(除非您有统计上有效的方法来估算那些缺失值).

由于您只有一个单级因子变量,因此使用与如何在对比度可以达到"的情况下执行GLM相同的方法仅适用于具有2个或更多水平的因子"?.

new_dat$Utilities <- 1

simplelm <- lm(log_SalePrice ~ 0 + ., data = new_dat)

模型现在成功运行.但是,它是排名不足.您可能想采取一些措施来解决该问题,但可以将其保留为好.

b <- coef(simplelm)

length(b)
#[1] 301

sum(is.na(b))
#[1] 9

simplelm$rank
#[1] 292

I've found Error in contrasts when defining a linear model in R and have followed the suggestions there, but none of my factor variables take on only one value and I am still experiencing the same issue.

This is the dataset I'm using: https://www.dropbox.com/s/em7xphbeaxykgla/train.csv?dl=0.

This is the code I'm trying to run:

simplelm <- lm(log_SalePrice ~ ., data = train)

#Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
# contrasts can be applied only to factors with 2 or more levels

What is the issue?

解决方案

Thanks for providing your dataset (I hope that link will forever be valid so that everyone can access). I read it into a data frame train.

Using the debug_contr_error, debug_contr_error2 and NA_preproc helper functions provided by How to debug "contrasts can be applied only to factors with 2 or more levels" error?, we can easily analyze the problem.

info <- debug_contr_error2(log_SalePrice ~ ., train)

## the data frame that is actually used by `lm`
dat <- info$mf

## number of cases in your dataset
nrow(train)
#[1] 1460

## number of complete cases used by `lm`
nrow(dat)
#[1] 1112

## number of levels for all factor variables in `dat`
info$nlevels
#     MSZoning        Street         Alley      LotShape   LandContour 
#            4             2             3             4             4 
#    Utilities     LotConfig     LandSlope  Neighborhood    Condition1 
#            1             5             3            25             9 
#   Condition2      BldgType    HouseStyle     RoofStyle      RoofMatl 
#            6             5             8             5             7 
#  Exterior1st   Exterior2nd    MasVnrType     ExterQual     ExterCond 
#           14            16             4             4             4 
#   Foundation      BsmtQual      BsmtCond  BsmtExposure  BsmtFinType1 
#            6             5             5             5             7 
# BsmtFinType2       Heating     HeatingQC    CentralAir    Electrical 
#            7             5             5             2             5 
#  KitchenQual    Functional   FireplaceQu    GarageType  GarageFinish 
#            4             6             6             6             3 
#   GarageQual    GarageCond    PavedDrive        PoolQC         Fence 
#            5             5             3             4             5 
#  MiscFeature      SaleType SaleCondition  MiscVal_bool      MoYrSold 
#            4             9             6             2            55 

As you can see, Utilities is the offending variable here as it has only 1 level.

Since you have many character / factor variables in train, I wonder whether you have NA for them. If we add NA as a valid level, we could possibly get more complete cases.

new_train <- NA_preproc(train)

new_info <- debug_contr_error2(log_SalePrice ~ ., new_train)

new_dat <- new_info$mf

nrow(new_dat)
#[1] 1121

new_info$nlevels
#     MSZoning        Street         Alley      LotShape   LandContour 
#            5             2             3             4             4 
#    Utilities     LotConfig     LandSlope  Neighborhood    Condition1 
#            1             5             3            25             9 
#   Condition2      BldgType    HouseStyle     RoofStyle      RoofMatl 
#            6             5             8             5             7 
#  Exterior1st   Exterior2nd    MasVnrType     ExterQual     ExterCond 
#           14            16             4             4             4 
#   Foundation      BsmtQual      BsmtCond  BsmtExposure  BsmtFinType1 
#            6             5             5             5             7 
# BsmtFinType2       Heating     HeatingQC    CentralAir    Electrical 
#            7             5             5             2             6 
#  KitchenQual    Functional   FireplaceQu    GarageType  GarageFinish 
#            4             6             6             6             3 
#   GarageQual    GarageCond    PavedDrive        PoolQC         Fence 
#            5             5             3             4             5 
#  MiscFeature      SaleType SaleCondition  MiscVal_bool      MoYrSold 
#            4             9             6             2            55

We do get more complete cases, but Utilities still has one level. This means that most incomplete cases are actually caused by NA in your numerical variables, which we can do nothing (unless you have a statistically valid way to impute those missing values).

As you only have one single-level factor variable, the same method as given in How to do a GLM when "contrasts can be applied only to factors with 2 or more levels"? will work.

new_dat$Utilities <- 1

simplelm <- lm(log_SalePrice ~ 0 + ., data = new_dat)

The model now runs successfully. However, it is rank-deficient. You probably want to do something to address it, but leaving it as it is is fine.

b <- coef(simplelm)

length(b)
#[1] 301

sum(is.na(b))
#[1] 9

simplelm$rank
#[1] 292

这篇关于R:将线性模型与`lm`拟合时的对比误差的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆