由于使用了不必要的变量进行线性回归,R是否总是将NA作为系数返回? [英] Does R always return NA as a coefficient as a result of linear regression with unnecessary variables?
问题描述
我的问题是关于不必要的预测变量,即不提供任何新线性信息的变量或作为其他预测变量的线性组合的变量.如您所见,swiss
数据集有六个变量.
My question is about the unnecessary predictors, namely the variables that do not provide any new linear information or the variables that are linear combinations of the other predictors. As you can see the swiss
dataset has six variables.
library(swiss)
names(swiss)
# "Fertility" "Agriculture" "Examination" "Education"
# "Catholic" "Infant.Mortality"
现在,我介绍一个新变量ec
.它是Examination
和Education
的线性组合.
Now I introduce a new variable ec
. It is the linear combination of Examination
and Education
.
ec <- swiss$Examination + swiss$Catholic
当我们使用不必要的变量进行线性回归时,R会丢弃作为其他项的线性组合的项,并返回NA
作为其系数.下面的命令完美地说明了这一点.
When we run a linear regression with unnecessary variables, R drops terms that are linear combinations of other terms and returns NA
as their coefficients. The command below illustrates the point perfectly.
lm(Fertility ~ . + ec, swiss)
Coefficients:
(Intercept) Agriculture Examination Education
66.9152 -0.1721 -0.2580 -0.8709
Catholic Infant.Mortality ec
0.1041 1.0770 NA
但是,当我们首先在ec
上进行回归,然后在所有回归上进行回归时,如下所示,
However, when we regress first on ec
and then all of the regressors as shown below,
lm(Fertility ~ ec + ., swiss)
Coefficients:
(Intercept) ec Agriculture Examination
66.9152 0.1041 -0.1721 -0.3621
Education Catholic Infant.Mortality
-0.8709 NA 1.0770
我希望Catholic
和Examination
的系数均为NA
.变量ec
是两者的线性组合,但最终Examination
的系数不是NA
,而Catholic
的系数是NA
.
I would expect the coefficients of both Catholic
and Examination
to be NA
. The variable ec
is linear combination of both of them but in the end the coefficient of Examination
is not NA
whereas that of the Catholic
is NA
.
有人可以解释原因吗?
推荐答案
会有
NA
吗?
是的.添加这些列不会扩大列空间.结果矩阵是秩不足的.
Yes. Adding these columns does not enlarge column space. The resulting matrix is rank-deficient.
多少个
NA
?
这取决于数字排名.
number of NA = number of coefficients - rank of model matrix
在您的示例中,在引入ec
之后,将有一个NA
.更改模型公式中协变量的规范顺序实际上是在对模型矩阵进行列改组.这不会改变矩阵等级,因此无论指定顺序如何,您始终只会得到一个NA
.
In your example, after introducing ec
, there will be one NA
. Changing the specification order for covariates in the model formula is essentially doing column shuffling for the model matrix. This does not change the matrix rank, so you will always get only one NA
regardless of your specification order.
好,但是
NA
是哪个?
lm
通过受限制的列旋转进行 LINPACK QR因式分解.协变量的顺序影响哪个是NA
.通常,先到先得" 原理成立,并且NA
的位置是可以预测的.以您的示例为例.在第一个规范中,这些共线性项以Examination
,Catholic
,ec
顺序显示,因此第三个ec
具有NA
系数.在您的第二个规范中,这些术语以ec
,Examination
,Catholic
顺序显示,而第三个Catholic
具有NA
系数.请注意,尽管拟合值是不变的,但是系数估计并不会随规格顺序不变.
lm
does LINPACK QR factorization with restricted column pivoting. The order of covariates affects which one is NA
. Generally, a "first comes, first serves" principle holds, and the position of NA
is quite predictable. Take your examples for illustration. In the first specification, these co-linear terms show up in Examination
, Catholic
, ec
order, so the third one ec
has NA
coefficient. In your second specification, these terms show up in ec
, Examination
, Catholic
order, and the third one Catholic
has NA
coefficient. Note that coefficients estimation is not invariant to specification order, although fitted values are invariant.
如果采用具有 complete 列旋转的 LAPACK QR因式分解,则系数估计将不随规格顺序变化.但是,NA
的位置不像 LINPACK 情况那样可预测,而是完全由数字确定的.
If LAPACK QR factorization with complete column pivoting is taken, coefficients estimation would be invariant to specification order. However, the position of NA
is not as predictable as in LINPACK case, and is purely decided numerically.
基于LAPACK的QR因式分解在mgcv
程序包中实现.使用REML估计时会检测到数字等级,并且无法识别的系数报告为0(不是NA
).因此,我们可以在线性模型估计中比较lm
和gam
/bam
.首先,我们构建一个玩具数据集.
LAPACK based QR factorization is implemented in mgcv
package. Numerical rank is detected when REML estimation is used, and unidentifiable coefficients are reported as 0 (not NA
). So we can make a comparison between lm
and gam
/ bam
in linear model estimation. Let's first construct a toy dataset.
set.seed(0)
# an initial full rank matrix
X <- matrix(runif(500 * 10), 500)
# make the last column as a random linear combination of previous 9 columns
X[, 10] <- X[, -10] %*% runif(9)
# a random response
Y <- rnorm(500)
现在,我们重新整理X
的列,以查看NA
在lm
估计下是否改变其位置,或者0在gam
和bam
估计下是否改变其位置.
Now we shuffle columns of X
to see whether NA
changes its position under lm
estimation, or whether 0 changes its position under gam
and bam
estimation.
test <- function (fun = lm, seed = 0, ...) {
shuffleFit <- function (fun) {
shuffle <- sample.int(ncol(X))
Xs <- X[, shuffle]
b <- unname(coef(fun(Y ~ Xs, ...)))
back <- order(shuffle)
c(b[1], b[-1][back])
}
set.seed(seed)
oo <- t(replicate(10, shuffleFit(fun)))
colnames(oo) <- c("intercept", paste0("X", 1:ncol(X)))
oo
}
首先,我们使用lm
进行检查:
First we check with lm
:
test(fun = lm)
我们看到NA
通过X
的列改组改变了其位置.估计系数也不同.
We see that NA
changes its position with column shuffling of X
. Estimated coefficients vary, too.
现在我们用gam
library(mgcv)
test(fun = gam, method = "REML")
我们看到估计对于X
的列改组是不变的,并且X5
的系数始终为0.
We see that estimation is invariant to column shuffling of X
, and coefficient for X5
is always 0.
最后,我们检查bam
(对于像这样的小型数据集,bam
较慢.它是为大型或超大型数据集设计的,因此以下情况明显较慢).
Finally we check bam
(bam
is slow for small dataset like here. It is designed for large or super large dataset. So the following is noticeably slower).
test(fun = bam, gc.level = -1)
结果与我们在gam
中看到的结果相同.
The result is as same as what we see for gam
.
这篇关于由于使用了不必要的变量进行线性回归,R是否总是将NA作为系数返回?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!