由于使用了不必要的变量进行线性回归,R是否总是将NA作为系数返回? [英] Does R always return NA as a coefficient as a result of linear regression with unnecessary variables?

查看:345
本文介绍了由于使用了不必要的变量进行线性回归,R是否总是将NA作为系数返回?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的问题是关于不必要的预测变量,即不提供任何新线性信息的变量或作为其他预测变量的线性组合的变量.如您所见,swiss数据集有六个变量.

My question is about the unnecessary predictors, namely the variables that do not provide any new linear information or the variables that are linear combinations of the other predictors. As you can see the swiss dataset has six variables.

library(swiss)
names(swiss)
# "Fertility"        "Agriculture"      "Examination"      "Education"        
# "Catholic"      "Infant.Mortality"

现在,我介绍一个新变量ec.它是ExaminationEducation的线性组合.

Now I introduce a new variable ec. It is the linear combination of Examination and Education.

ec <- swiss$Examination + swiss$Catholic

当我们使用不必要的变量进行线性回归时,R会丢弃作为其他项的线性组合的项,并返回NA作为其系数.下面的命令完美地说明了这一点.

When we run a linear regression with unnecessary variables, R drops terms that are linear combinations of other terms and returns NA as their coefficients. The command below illustrates the point perfectly.

lm(Fertility ~ . + ec, swiss)

Coefficients:
 (Intercept)       Agriculture       Examination         Education            
     66.9152           -0.1721           -0.2580           -0.8709 

Catholic  Infant.Mortality    ec

  0.1041            1.0770    NA

但是,当我们首先在ec上进行回归,然后在所有回归上进行回归时,如下所示,

However, when we regress first on ec and then all of the regressors as shown below,

lm(Fertility ~ ec + ., swiss)

 Coefficients:
 (Intercept)                ec       Agriculture       Examination           
     66.9152            0.1041           -0.1721           -0.3621           
  Education          Catholic     Infant.Mortality  
    -0.8709                NA            1.0770  

我希望CatholicExamination的系数均为NA.变量ec是两者的线性组合,但最终Examination的系数不是NA,而Catholic的系数是NA.

I would expect the coefficients of both Catholic and Examination to be NA. The variable ec is linear combination of both of them but in the end the coefficient of Examination is not NA whereas that of the Catholic is NA.

有人可以解释原因吗?

推荐答案

会有NA吗?

是的.添加这些列不会扩大列空间.结果矩阵是秩不足的.

Yes. Adding these columns does not enlarge column space. The resulting matrix is rank-deficient.

多少个NA?

这取决于数字排名.

number of NA = number of coefficients - rank of model matrix

在您的示例中,在引入ec之后,将有一个NA.更改模型公式中协变量的规范顺序实际上是在对模型矩阵进行列改组.这不会改变矩阵等级,因此无论指定顺序如何,您始终只会得到一个NA.

In your example, after introducing ec, there will be one NA. Changing the specification order for covariates in the model formula is essentially doing column shuffling for the model matrix. This does not change the matrix rank, so you will always get only one NA regardless of your specification order.

好,但是NA是哪个?

lm通过受限制的列旋转进行 LINPACK QR因式分解.协变量的顺序影响哪个是NA.通常,先到先得" 原理成立,并且NA的位置是可以预测的.以您的示例为例.在第一个规范中,这些共线性项以ExaminationCatholicec顺序显示,因此第三个ec具有NA系数.在您的第二个规范中,这些术语以ecExaminationCatholic顺序显示,而第三个Catholic具有NA系数.请注意,尽管拟合值是不变的,但是系数估计并不会随规格顺序不变.

lm does LINPACK QR factorization with restricted column pivoting. The order of covariates affects which one is NA. Generally, a "first comes, first serves" principle holds, and the position of NA is quite predictable. Take your examples for illustration. In the first specification, these co-linear terms show up in Examination, Catholic, ec order, so the third one ec has NA coefficient. In your second specification, these terms show up in ec, Examination, Catholic order, and the third one Catholic has NA coefficient. Note that coefficients estimation is not invariant to specification order, although fitted values are invariant.

如果采用具有 complete 列旋转的 LAPACK QR因式分解,则系数估计将不随规格顺序变化.但是,NA的位置不像 LINPACK 情况那样可预测,而是完全由数字确定的.

If LAPACK QR factorization with complete column pivoting is taken, coefficients estimation would be invariant to specification order. However, the position of NA is not as predictable as in LINPACK case, and is purely decided numerically.

基于LAPACK的QR因式分解在mgcv程序包中实现.使用REML估计时会检测到数字等级,并且无法识别的系数报告为0(不是NA).因此,我们可以在线性模型估计中比较lmgam/bam.首先,我们构建一个玩具数据集.

LAPACK based QR factorization is implemented in mgcv package. Numerical rank is detected when REML estimation is used, and unidentifiable coefficients are reported as 0 (not NA). So we can make a comparison between lm and gam / bam in linear model estimation. Let's first construct a toy dataset.

set.seed(0)

# an initial full rank matrix
X <- matrix(runif(500 * 10), 500)
# make the last column as a random linear combination of previous 9 columns
X[, 10] <- X[, -10] %*% runif(9)

# a random response
Y <- rnorm(500)

现在,我们重新整理X的列,以查看NAlm估计下是否改变其位置,或者0在gambam估计下是否改变其位置.

Now we shuffle columns of X to see whether NA changes its position under lm estimation, or whether 0 changes its position under gam and bam estimation.

test <- function (fun = lm, seed = 0, ...) {
  shuffleFit <- function (fun) {
    shuffle <- sample.int(ncol(X))
    Xs <- X[, shuffle]
    b <- unname(coef(fun(Y ~ Xs, ...)))
    back <- order(shuffle)
    c(b[1], b[-1][back])
    }
  set.seed(seed)
  oo <- t(replicate(10, shuffleFit(fun)))
  colnames(oo) <- c("intercept", paste0("X", 1:ncol(X)))
  oo
  }


首先,我们使用lm进行检查:


First we check with lm:

test(fun = lm)

我们看到NA通过X的列改组改变了其位置.估计系数也不同.

We see that NA changes its position with column shuffling of X. Estimated coefficients vary, too.

现在我们用gam

library(mgcv)
test(fun = gam, method = "REML")

我们看到估计对于X的列改组是不变的,并且X5的系数始终为0.

We see that estimation is invariant to column shuffling of X, and coefficient for X5 is always 0.

最后,我们检查bam(对于像这样的小型数据集,bam较慢.它是为大型或超大型数据集设计的,因此以下情况明显较慢).

Finally we check bam (bam is slow for small dataset like here. It is designed for large or super large dataset. So the following is noticeably slower).

test(fun = bam, gc.level = -1)

结果与我们在gam中看到的结果相同.

The result is as same as what we see for gam.

这篇关于由于使用了不必要的变量进行线性回归,R是否总是将NA作为系数返回?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆